ZTE-EBASE opened a new pull request, #1226:
URL: https://github.com/apache/cloudberry/pull/1226

   … Data Ingestion
   
   gpfdist is a file distribution program in Cloudberry that can parallel load 
external data into the database. However, it has the drawback that data files 
must reside on the same machine as the tool. Therefore,extending it to support 
the SFTP protocol can address the above drawback and enable loading files from 
a remote server.
   
   <!-- Thank you for your contribution to Apache Cloudberry (Incubating)! -->
   
   Fixes #ISSUE_Number
   
   ### What does this PR do?
   By extending the `gpfdist` tool to support the SFTP protocol, remote data 
loading has been achieved, overcoming the challenge of having the tool and data 
files on the same machine.
   
   ### Type of Change
   New feature (non-breaking change)
   
   ### Test Plan
   <!-- How did you test these changes? -->
   - [ ] Unit tests added/updated
   - [ ] Integration tests added/updated
   - [ ] Passed `make installcheck`
   - [ ] Passed `make -C src/test installcheck-cbdb-parallel`
   
   ### Impact
   <!-- Remove sections that don't apply -->
   **Performance:**
   <!-- Any performance implications? -->
   
   **User-facing changes:**
   <!-- Any changes visible to users? -->
   
   **Dependencies:**
   The ssh2 library needs to be introduced during compilation and placed under 
`/usr/local`.
   
   ### Checklist
   - [ ] Followed [contribution 
guide](https://cloudberry.apache.org/contribute/code)
   - [ ] Added/updated documentation
   - [ ] Reviewed code for security implications
   - [ ] Requested review from [cloudberry 
committers](https://github.com/orgs/apache/teams/cloudberry-committers)
   
   ### Additional Context
   Under this approach, the location template for the external table is:
   ```shell
   CREATE EXTERNAL TABLE ext1 (d varchar(20)) location 
('gpfdist://ip:port/<sftp://sftp-user:passwd@sftp-hostip:sftp-port/file.csv>') 
format 'csv' (DELIMITER '|');
   ``` 
   
   Related Test Case:
   1 Start gpfdist
   ```shell
   [cdbberry@node196 ~]$ gpfdist -d /home/cdbberry/ -p 9876 -l gpfdist.log &
   [1] 83161
   [cdbberry@node196 ~]$ 2025-07-12 14:49:21 83161 INFO Before opening 
listening sockets - following listening sockets are available:
   2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
   2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
   2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
   2025-07-12 14:49:21 83161 INFO IPV6 socket: [::]:9876
   2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
   2025-07-12 14:49:21 83161 INFO Trying to open listening socket:
   2025-07-12 14:49:21 83161 INFO IPV4 socket: 0.0.0.0:9876
   2025-07-12 14:49:21 83161 INFO Opening listening socket succeeded
   Serving HTTP on port 9876, directory /home/cdbberry
   ``` 
   
   2 create  table (external)
   ```shell
   CREATE table test(
   id int,
   name varchar(20)
   );
   
   CREATE external table testww(
   id int,
   name varchar(20)
   )
   location 
   ('gpfdist://10.229.89.196:9876/<sftp://xxx:xxxx@xxx:22/xx.csv>')
   format 'csv' (delimiter as '|' NULL as '' FILL MISSING FIELDS) SEGMENT 
REJECT LIMIT 2 ROWS;
   ``` 
   
   3  data load
   ```shell
   insert into test select * from testww;
   ``` 
   
   4  result 
   ```shell
   postgres=# insert into test select * from test_ext;
   INSERT 0 10
   postgres=# select * from test;
    id |   name    
   ----+-----------
     2 | ZTE-EBASE
     3 | ZTE-EBASE
     4 | ZTE-EBASE
     6 | ZTE-EBASE
     7 | ZTE-EBASE
     8 | ZTE-EBASE
     9 | ZTE-EBASE
    10 | ZTE-EBASE
     1 | ZTE-EBASE
     5 | ZTE-EBASE
   (10 rows)
   ``` 
   
   cat test.csv 
   1|ZTE-EBASE
   2|ZTE-EBASE
   3|ZTE-EBASE
   4|ZTE-EBASE
   5|ZTE-EBASE
   6|ZTE-EBASE
   7|ZTE-EBASE
   8|ZTE-EBASE
   9|ZTE-EBASE
   10|ZTE-EBASE
   
   The amount and content of the table data are consistent with the file.
   
   ### CI Skip Instructions
   <!--
   To skip CI builds, add the appropriate CI skip identifier to your PR title.
   The identifier must:
   - Be in square brackets []
   - Include the word "ci" and either "skip" or "no"
   - Only use for documentation-only changes or when absolutely necessary
   -->
   
   ---
   <!-- Join our community:
   - Mailing list: 
[[email protected]](https://lists.apache.org/[email protected])
 (subscribe: [email protected])
   - Discussions: https://github.com/apache/cloudberry/discussions -->
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to