This is an automated email from the ASF dual-hosted git repository.
tomshawn pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry-site.git
The following commit(s) were added to refs/heads/main by this push:
new 1f50f52bfb docs: add more baseline docs (loading data related) (#318)
1f50f52bfb is described below
commit 1f50f52bfb5a80aad70b6efa4c46ac445a1e66bc
Author: TomShawn <[email protected]>
AuthorDate: Thu Sep 18 14:29:44 2025 +0800
docs: add more baseline docs (loading data related) (#318)
* docs: add more baseline docs (loading data related)
* bump to 2.x
* Update handle-data-errors.md
---
docs/data-loading/handle-data-errors.md | 331 +++++++++++++++++
docs/data-loading/index.md | 7 +-
docs/data-loading/load-data-from-s3.md | 184 +++++++++
docs/data-loading/load-data-using-gpfdists.md | 262 +++++++++++++
docs/data-loading/load-data-using-pxf.md | 327 ++++++++++++++++
docs/developer/write-a-foreign-data-wrapper.md | 411 +++++++++++++++++++++
sidebars.ts | 33 +-
.../version-2.x/data-loading/handle-data-errors.md | 331 +++++++++++++++++
versioned_docs/version-2.x/data-loading/index.md | 7 +-
.../version-2.x/data-loading/load-data-from-s3.md | 184 +++++++++
.../data-loading/load-data-using-gpfdists.md | 262 +++++++++++++
.../data-loading/load-data-using-pxf.md | 327 ++++++++++++++++
.../developer/write-a-foreign-data-wrapper.md | 411 +++++++++++++++++++++
versioned_sidebars/version-2.x-sidebars.json | 17 +-
14 files changed, 3079 insertions(+), 15 deletions(-)
diff --git a/docs/data-loading/handle-data-errors.md
b/docs/data-loading/handle-data-errors.md
new file mode 100644
index 0000000000..eb42c3689e
--- /dev/null
+++ b/docs/data-loading/handle-data-errors.md
@@ -0,0 +1,331 @@
+---
+title: Handle Data Loading Errors
+---
+
+# Handle Data Loading Errors
+
+Real-world data is often imperfect, containing formatting errors, missing
values, or inconsistent data types. Apache Cloudberry provides robust error
handling mechanisms that allow you to load correctly formatted data while
isolating and managing problematic rows, ensuring your ETL processes are
resilient and reliable.
+
+By default, if external table data contains any error, the entire load
operation fails and no data is loaded. With error handling enabled, you can
load valid data and deal with problematic rows separately.
+
+## Error handling modes
+
+Apache Cloudberry supports two error handling approaches:
+
+1. Single row error isolation allows the system to skip individual problematic
rows and continue processing the remaining valid data, preventing entire
operations from failing due to isolated data quality issues.
+2. Error logging functionality captures comprehensive details about
problematic data rows, including error descriptions, line numbers, and the
actual data that caused the failure, enabling thorough analysis and remediation.
+
+## Single row error isolation
+
+### Basic error isolation
+
+Enable single row error isolation by adding `SEGMENT REJECT LIMIT` to your
external table definition:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_with_errors (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+SEGMENT REJECT LIMIT 100;
+```
+
+This configuration allows up to 100 rows with errors per segment before the
operation fails.
+
+### Percentage-based limits
+
+You can also specify error limits as a percentage:
+
+```sql
+CREATE EXTERNAL TABLE sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+SEGMENT REJECT LIMIT 5 PERCENT;
+```
+
+This allows up to 5% of rows to contain errors before failing.
+
+## Error logging
+
+### Enable error logging
+
+Use `LOG ERRORS` to capture detailed information about rejected rows:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_logged (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS
+SEGMENT REJECT LIMIT 50;
+```
+
+### Persistent error logging
+
+For long-term error analysis, use persistent error logging:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_persistent (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS PERSISTENTLY
+SEGMENT REJECT LIMIT 25;
+```
+
+## View error information
+
+### Query error logs
+
+When error logging is enabled, Apache Cloudberry creates error log tables that
you can query:
+
+```sql
+-- Views recent errors from the current session.
+SELECT * FROM gp_read_error_log('sales_data_logged');
+```
+
+### Error log table structure
+
+The error log contains these columns:
+
+| Column | Description |
+|--------|-------------|
+| `cmdtime` | Timestamp when the error occurred |
+| `relname` | Name of the external table |
+| `filename` | Source file containing the error |
+| `linenum` | Line number in the source file |
+| `bytenum` | Byte position in the source file |
+| `errmsg` | Error message description |
+| `rawdata` | Raw data that caused the error |
+| `rawbytes` | Raw bytes of the problematic data |
+
+### Example error analysis
+
+```sql
+-- Find the most common error types
+SELECT errmsg, COUNT(*) as error_count
+FROM gp_read_error_log('sales_data_logged')
+GROUP BY errmsg
+ORDER BY error_count DESC;
+
+-- Views specific error details.
+SELECT cmdtime, filename, linenum, errmsg, rawdata
+FROM gp_read_error_log('sales_data_logged')
+WHERE errmsg LIKE '%invalid input syntax%'
+ORDER BY cmdtime DESC;
+```
+
+## Common data errors and solutions
+
+### Data type conversion errors
+
+**Error**: `invalid input syntax for type numeric`
+**Cause**: Non-numeric data in numeric columns
+**Solution**: Clean data or use text columns with post-processing
+
+```sql
+-- Original problematic data: "N/A" in amount column.
+-- Solution: Use text type and handle conversion later.
+CREATE EXTERNAL TABLE sales_flexible (
+ transaction_id int,
+ product_name text,
+ sale_date text, -- Use text for flexible parsing
+ amount text -- Use text to handle "N/A" values
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 10 PERCENT;
+```
+
+### Date format issues
+
+**Error**: `invalid input syntax for type date`
+**Cause**: Inconsistent date formats
+**Solution**: Standardize date formats or use flexible parsing.
+
+```sql
+-- Handles multiple date formats in post-processing.
+SELECT
+ transaction_id,
+ product_name,
+ CASE
+ WHEN sale_date ~ '^\d{4}-\d{2}-\d{2}$' THEN sale_date::date
+ WHEN sale_date ~ '^\d{2}/\d{2}/\d{4}$' THEN to_date(sale_date,
'MM/DD/YYYY')
+ ELSE NULL
+ END as parsed_date,
+ amount::decimal(10,2)
+FROM sales_flexible
+WHERE sale_date IS NOT NULL;
+```
+
+### Miss or extra columns
+
+**Error**: `extra data after last expected column`
+**Cause**: Inconsistent number of columns
+**Solution**: Use more flexible table definition
+
+```sql
+-- Adds extra optional columns to handle variable column counts.
+CREATE EXTERNAL TABLE flexible_sales (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2),
+ extra_field1 text, -- Optional fields
+ extra_field2 text,
+ extra_field3 text
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
+```
+
+### Character encoding issues
+
+**Error**: `invalid byte sequence`
+**Cause**: Character encoding mismatch
+**Solution**: Specify correct encoding
+
+```sql
+CREATE EXTERNAL TABLE encoded_data (
+ id int,
+ description text
+)
+LOCATION ('gpfdist://etl-server:8081/data/*.txt')
+FORMAT 'TEXT' (DELIMITER '|')
+ENCODING 'LATIN1' -- Specify encoding
+LOG ERRORS SEGMENT REJECT LIMIT 5 PERCENT;
+```
+
+## Error handling strategies
+
+### Two-phase loading
+
+Use a staging approach for complex data cleaning:
+
+```sql
+-- Phase 1: Load into staging table with flexible types
+CREATE EXTERNAL TABLE sales_staging (
+ transaction_id text,
+ product_name text,
+ sale_date text,
+ amount text,
+ raw_line text -- Store entire row for complex cases
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV'
+LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
+
+-- Phase 2: Clean and insert into final table
+INSERT INTO sales_final (transaction_id, product_name, sale_date, amount)
+SELECT
+ transaction_id::int,
+ product_name,
+ sale_date::date,
+ amount::decimal(10,2)
+FROM sales_staging
+WHERE transaction_id ~ '^\d+$' -- Validate numeric ID
+ AND sale_date ~ '^\d{4}-\d{2}-\d{2}$' -- Validate date format
+ AND amount ~ '^\d+\.?\d*$'; -- Validate amount format
+```
+
+### Error threshold monitoring
+
+Set up monitoring for error rates:
+
+```sql
+-- Function to check error rate
+CREATE OR REPLACE FUNCTION check_error_rate(table_name text, threshold_percent
numeric)
+RETURNS boolean AS $$
+DECLARE
+ error_count int;
+ total_count int;
+ error_rate numeric;
+BEGIN
+ SELECT COUNT(*) INTO error_count FROM gp_read_error_log(table_name);
+
+ -- Estimate total processed rows (depends on your tracking method)
+ SELECT reltuples INTO total_count FROM pg_class WHERE relname = table_name;
+
+ IF total_count > 0 THEN
+ error_rate := (error_count::numeric / total_count) * 100;
+ RETURN error_rate <= threshold_percent;
+ END IF;
+
+ RETURN true;
+END;
+$$ LANGUAGE plpgsql;
+
+-- Usage
+SELECT check_error_rate('sales_data_logged', 5.0); -- Check if error rate is
under 5%
+```
+
+## Best practices
+
+### Design for errors
+
+1. Begin with permissive table definitions using text data types and generous
error limits to understand the full scope of data quality issues before
implementing strict validation rules.
+2. Implement data validation incrementally by gradually adding constraints and
type conversions as you gain confidence in data quality and identify patterns
in the error logs.
+3. Establish regular monitoring and review processes for error logs to
identify systematic data quality issues and trends that may indicate problems
with source systems or data processing pipelines.
+
+### Error limit guidelines
+
+| Data Quality | Suggested Reject Limit | Use Case |
+|--------------|----------------------|----------|
+| High quality | 1-10 rows | Production systems |
+| Medium quality | 1-5% | Development/testing |
+| Poor quality | 10-20% | Initial data exploration |
+| Unknown quality | 50% | Data discovery phase |
+
+### Operational procedures
+
+1. Establish a regular schedule for cleaning up old error logs to prevent
excessive storage consumption and maintain system performance.
+2. Configure monitoring systems to alert when error rates exceed predefined
thresholds, enabling quick response to data quality issues.
+3. Maintain communication channels with data providers to share error patterns
and collaborate on improving source data quality.
+
+### Performance considerations
+
+1. Error logging functionality introduces some computational overhead during
data loading operations, so consider this impact when processing large datasets.
+2. Setting higher segment reject limits allows more rows to be processed
before the operation fails, but this also means more resources are consumed
analyzing problematic data.
+3. Error log tables can grow significantly in production environments, so
implement monitoring to track storage usage and prevent disk space issues.
+
+## Troubleshooting
+
+### High error rates
+
+If you are experiencing high error rates:
+
+1. Analyze the error log entries to identify patterns or systematic issues in
your data, such as consistent formatting problems or missing values in specific
columns.
+2. Work with data providers to verify the quality and consistency of source
data, including checking for recent changes in data formats or processing.
+3. Carefully review your external table definitions to ensure that column data
types, delimiters, and format specifications accurately match the actual data
structure.
+4. Start troubleshooting with small data samples to isolate issues quickly
before processing larger datasets.
+
+### Performance issues
+
+If error handling is impacting performance:
+
+1. Fine-tune your segment reject limits to balance between fault tolerance and
processing efficiency, avoiding unnecessarily high thresholds that waste
resources.
+2. For complex data with known quality issues, consider implementing a
two-phase loading process using staging tables with flexible data types.
+3. Break large data loads into smaller, manageable batches to reduce memory
pressure and improve error isolation.
+4. Continuously monitor system resources including memory usage, disk I/O, and
storage consumption during data loading operations.
+
+## Learn more
+
+- [Load Data Using gpfdist](/docs/data-loading/load-data-using-gpfdist.md)
+- [Load Data Using COPY](/docs/data-loading/load-data-using-copy.md)
diff --git a/docs/data-loading/index.md b/docs/data-loading/index.md
index cdf9b96263..7c575533b7 100644
--- a/docs/data-loading/index.md
+++ b/docs/data-loading/index.md
@@ -23,9 +23,12 @@ Apache Cloudberry offers multiple data loading solutions,
and you can select dif
| -------------------------- |
----------------------------------------------------------- |
------------------------------------------------------------ | -------- |
| [`copy`](/docs/data-loading/load-data-using-copy.md) |
Local file system<br /><br />• Coordinator node host (for a single file)<br />•
Segment node host (for multiple files) | • TXT<br />• CSV<br />• Binary
| No |
| [`file://` protocol](/docs/data-loading/load-data-using-file-protocol.md)
| Local file system (local segment host, accessible only by superuser) | •
TXT<br />• CSV | Yes |
-| [`gpfdist`](/docs/data-loading/load-data-using-gpfdist.md) | Local host
files or files accessible via internal network | • TXT<br />• CSV<br />• Any
delimited text format supported by the `FORMAT` clause<br />• XML and JSON
(requires conversion to text format via YAML configuration file) | Yes |
|
+| [`gpfdist`](/docs/data-loading/load-data-using-gpfdist.md) | Local host
files or files accessible via internal network | • TXT<br />• CSV<br />• Any
delimited text format supported by the `FORMAT` clause<br />• XML and JSON
(requires conversion to text format via YAML configuration file) | Yes |
+| [`gpfdists`](/docs/data-loading/load-data-using-gpfdists.md) (secure
`gpfdist`) | Local host files or files accessible via internal network (with
SSL encryption) | • TXT<br />• CSV<br />• Any delimited text format supported
by the `FORMAT` clause<br />• XML and JSON (requires conversion to text format
via YAML configuration file) | Yes |
| [Batch loading using `gpload`](/docs/data-loading/load-data-using-gpload.md)
(with `gpfdist` as the underlying worker) | Local host files or files
accessible via internal network | • TXT<br />• CSV<br />• Any delimited text
format supported by the `FORMAT` clause<br />• XML and JSON (require conversion
to text format via YAML configuration file) | Yes |
-| [Creating external web
tables](/docs/data-loading/load-data-from-web-services.md) | Data
pulled from network services or from any source accessible by command lines | •
TXT<br />• CSV | Yes |
+| [`s3://` protocol](/docs/data-loading/load-data-from-s3.md) | Amazon
S3 and S3-compatible object stores | • TXT<br />• CSV<br />• Any delimited text
format supported by the `FORMAT` clause
| Yes |
+| [`pxf://` protocol](/docs/data-loading/load-data-using-pxf.md) |
Object stores (S3, Azure, GCS), Hadoop systems (HDFS, Hive, HBase), SQL
databases | • Text<br />• Avro<br />• JSON<br />• Parquet<br />• ORC<br />• CSV
| Yes |
+| [Create external web
tables](/docs/data-loading/load-data-from-web-services.md) | Data
pulled from network services or from any source accessible by command lines | •
TXT<br />• CSV | Yes |
| [Kafka FDW](/docs/data-loading/load-data-from-kafka-using-fdw.md) |
Streaming data from Apache Kafka | • JSON<br />• CSV
| No |
## Learn more
diff --git a/docs/data-loading/load-data-from-s3.md
b/docs/data-loading/load-data-from-s3.md
new file mode 100644
index 0000000000..f9776e151b
--- /dev/null
+++ b/docs/data-loading/load-data-from-s3.md
@@ -0,0 +1,184 @@
+---
+title: Load Data from Amazon S3
+---
+
+# Load Data from Amazon S3 Using the `s3` Protocol
+
+The `s3` protocol is used in a URL that specifies the location of an Amazon S3
bucket and a prefix to use for reading or writing files in the bucket.
+
+Amazon Simple Storage Service (Amazon S3) provides secure, durable,
highly-scalable object storage. For information about Amazon S3, see [Amazon
S3](https://aws.amazon.com/s3/).
+
+You can define read-only external tables that use existing data files in the
S3 bucket for table data, or writable external tables that store the data from
`INSERT` operations to files in the S3 bucket. Apache Cloudberry uses the S3
URL and prefix specified in the protocol URL either to select one or more files
for a read-only table, or to define the location and filename format to use
when uploading S3 files for `INSERT` operations to writable tables.
+
+The `s3` protocol also supports [Dell Elastic Cloud
Storage](https://www.dell.com/en-us/dt/learn/data-storage/ecs.htm) (ECS), an
Amazon S3 compatible service.
+
+:::note
+The `pxf` protocol can access data in S3 and other object store systems such
as Azure, Google Cloud Storage, and Minio. The `pxf` protocol can also access
data in external Hadoop systems (HDFS, Hive, HBase), and SQL databases. See
[`pxf://` protocol](/docs/data-loading/load-data-using-pxf.md).
+:::
+
+## Configure the s3 protocol
+
+You must configure the `s3` protocol before you can use it. Perform these
steps in each database in which you want to use the protocol:
+
+1. Create the read and write functions for the `s3` protocol library:
+
+ ```sql
+ CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS
+ '$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;
+ ```
+
+ ```sql
+ CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
+ '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;
+ ```
+
+2. Declare the `s3` protocol and specify the read and write functions you
created in the previous step:
+
+ To allow only Apache Cloudberry superusers to use the protocol, create it
as follows:
+
+ ```sql
+ CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
+ ```
+
+ If you want to permit non-superusers to use the `s3` protocol, create it
as a `TRUSTED` protocol and `GRANT` access to those users. For example:
+
+ ```sql
+ CREATE TRUSTED PROTOCOL s3 (writefunc = write_to_s3, readfunc =
read_from_s3);
+ GRANT ALL ON PROTOCOL s3 TO user1, user2;
+ ```
+
+ :::note
+ The protocol name `s3` must be the same as the protocol of the URL
specified for the external table that you create to access an S3 resource.
+ :::
+
+ The corresponding function is called by every Apache Cloudberry segment
instance.
+
+## Use s3 external tables
+
+Follow these basic steps to use the `s3` protocol with Apache Cloudberry
external tables. Each step includes links to relevant topics from which you can
obtain more information.
+
+1. [Configure the s3 Protocol](#configure-the-s3-protocol).
+2. Create the `s3` protocol configuration file:
+
+ 1. Create a template `s3` protocol configuration file using the
`gpcheckcloud` utility:
+
+ ```shell
+ gpcheckcloud -t > ./mytest_s3.config
+ ```
+
+ 2. (Optional) Edit the template file to specify the `accessid` and
`secret` authentication credentials required to connect to the S3 location.
+
+3. Apache Cloudberry can access an `s3` protocol configuration file when the
file is located on each segment host or when the file is served up by an
`http/https` server. Identify where you plan to locate the configuration file,
and note the location and configuration option (if applicable).
+
+ If you are relying on the AWS credential file to authenticate, this file
must reside at `~/.aws/credentials` on each Apache Cloudberry segment host.
+
+4. Use the `gpcheckcloud` utility to validate connectivity to the S3 bucket.
You must specify the S3 endpoint name and bucket that you want to check.
+
+ For example, if the `s3` protocol configuration file resides in the
default location, you would run the following command:
+
+ ```shell
+ gpcheckcloud -c "s3://<s3-endpoint>/<s3-bucket>"
+ ```
+
+ `gpcheckcloud` attempts to connect to the S3 endpoint and lists any files
in the S3 bucket, if available. A successful connection ends with the message:
+
+ ```
+ Your configuration works well.
+ ```
+
+ You can optionally use `gpcheckcloud` to validate uploading to and
downloading from the S3 bucket.
+
+5. Create an s3 external table by specifying an `s3` protocol URL in the
`CREATE EXTERNAL TABLE` command, `LOCATION` clause.
+
+### Create a readable S3 external table
+
+For reading data from S3, specify the S3 location and file pattern:
+
+```sql
+CREATE EXTERNAL TABLE sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/sales-data/
config=/path/to/s3.config')
+FORMAT 'CSV' (HEADER);
+```
+
+### Create a writable S3 external table
+
+For writing data to S3:
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE sales_export (LIKE sales_data)
+LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/exports/
config=/path/to/s3.config')
+FORMAT 'CSV'
+DISTRIBUTED BY (transaction_id);
+```
+
+ For read-only s3 tables, the URL defines the location and prefix used to
select existing data files that comprise the s3 table. For example:
+
+ ```sql
+ CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
+
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
config=/home/gpadmin/aws_s3/s3.conf')
+ FORMAT 'csv';
+ ```
+
+ For writable s3 tables, the protocol URL defines the S3 location in which
Apache Cloudberry writes the data files that back the table for `INSERT`
operations. You can also specify a prefix that Apache Cloudberry will add to
the files that it creates. For example:
+
+ ```sql
+ CREATE WRITABLE EXTERNAL TABLE S3WRIT (LIKE S3TBL)
+
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
config=/home/gpadmin/aws_s3/s3.conf')
+ FORMAT 'csv';
+ ```
+
+## About the s3 Protocol LOCATION URL
+
+When you use the `s3` protocol, you specify an S3 file location and optional
configuration file location and region parameters in the `LOCATION` clause of
the `CREATE EXTERNAL TABLE` command. The syntax follows:
+
+```
+'s3://<S3_endpoint>[:<port>]/<bucket_name>/[<S3_prefix>] [region=<S3_region>]
[config=<config_file_location> | config_server=<url>] [section=<section_name>]'
+```
+
+The `s3` protocol requires that you specify the S3 endpoint and S3 bucket
name. Each Apache Cloudberry segment host must have access to the S3 location.
The optional S3_prefix value is used to select files for read-only S3 tables,
or as a filename prefix to use when uploading files for s3 writable tables.
+
+:::note
+The Apache Cloudberry `s3` protocol URL must include the S3 endpoint hostname.
+:::
+
+To specify an ECS endpoint (an Amazon S3 compatible service) in the `LOCATION`
clause, you must set the `s3` protocol configuration file parameter `version`
to `2`. The `version` parameter controls whether the `region` parameter is used
in the `LOCATION` clause. You can also specify an Amazon S3 location when the
`version` parameter is 2.
+
+:::note
+Although the S3_prefix is an optional part of the syntax, you should always
include an S3 prefix for both writable and read-only s3 tables to separate
datasets as part of the `CREATE EXTERNAL TABLE` syntax.
+:::
+
+For writable s3 tables, the `s3` protocol URL specifies the endpoint and
bucket name where Apache Cloudberry uploads data files for the table. The S3
file prefix is used for each new file uploaded to the S3 location as a result
of inserting data to the table.
+
+For read-only s3 tables, the S3 file prefix is optional. If you specify an
S3_prefix, then the `s3` protocol selects all files that start with the
specified prefix as data files for the external table. The `s3` protocol does
not use the slash character (`/`) as a delimiter, so a slash character
following a prefix is treated as part of the prefix itself.
+
+For example, consider the following 5 files that each have the S3_endpoint
named `s3-us-west-2.amazonaws.com` and the bucket_name `test1`:
+
+```
+s3://s3-us-west-2.amazonaws.com/test1/abc
+s3://s3-us-west-2.amazonaws.com/test1/abc/
+s3://s3-us-west-2.amazonaws.com/test1/abc/xx
+s3://s3-us-west-2.amazonaws.com/test1/abcdef
+s3://s3-us-west-2.amazonaws.com/test1/abcdefff
+```
+
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abc`,
then the `abc` prefix selects all 5 files.
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abc/`,
then the `abc/` prefix selects the files
`s3://s3-us-west-2.amazonaws.com/test1/abc/` and
`s3://s3-us-west-2.amazonaws.com/test1/abc/xx`.
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abcd`,
then the `abcd` prefix selects the files
`s3://s3-us-west-2.amazonaws.com/test1/abcdef` and
`s3://s3-us-west-2.amazonaws.com/test1/abcdefff`
+
+Wildcard characters are not supported in an S3_prefix; however, the S3 prefix
functions as if a wildcard character immediately followed the prefix itself.
+
+All of the files selected by the S3 URL (S3_endpoint/bucket_name/S3_prefix)
are used as the source for the external table, so they must have the same
format. Each file must also contain complete data rows. A data row cannot be
split between files.
+
+You use the `config` or `config_server` parameter to specify the location of
the required `s3` protocol configuration file that contains AWS connection
credentials and communication parameters.
+
+Use the `section` parameter to specify the name of the configuration file
section from which the `s3` protocol reads configuration parameters. The
default `section` is named `default`. When you specify the section name in the
configuration file, enclose it in brackets (for example, `[default]`).
+
+## Learn more
+
+- [Amazon S3 Documentation](https://aws.amazon.com/documentation/s3/)
+- [AWS S3
Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region)
diff --git a/docs/data-loading/load-data-using-gpfdists.md
b/docs/data-loading/load-data-using-gpfdists.md
new file mode 100644
index 0000000000..4df834f474
--- /dev/null
+++ b/docs/data-loading/load-data-using-gpfdists.md
@@ -0,0 +1,262 @@
+---
+title: Load Data Using gpfdists (Secure)
+---
+
+# Load Data Securely Using `gpfdists`
+
+The `gpfdists` protocol is a secure version of the `gpfdist` protocol that
enables encrypted communication between Apache Cloudberry and the gpfdist file
server. When you use `gpfdists`, all data transfer is encrypted using SSL,
protecting against eavesdropping and man-in-the-middle attacks.
+
+`gpfdists` provides the same high-performance parallel data loading
capabilities as `gpfdist`, but with additional security features essential for
production environments handling sensitive data.
+
+## Security features
+
+- All data transmitted between Apache Cloudberry segments and gpfdist servers
is encrypted using SSL/TLS protocols, protecting against eavesdropping and data
interception.
+- Mutual authentication is enforced through client certificates, ensuring that
both Apache Cloudberry and gpfdist servers verify each other's identities
before establishing connections.
+- The implementation uses TLSv1 protocol with AES_128_CBC_SHA encryption
algorithm to provide strong cryptographic protection for data in transit.
+- Secure server identification mechanisms prevent unauthorized systems from
masquerading as legitimate gpfdist servers, protecting against
man-in-the-middle attacks.
+
+## Before you begin
+
+To use `gpfdists`, make sure:
+
+- SSL certificates configured on all segment hosts.
+- gpfdist utility available on the file server host.
+- Network connectivity between segment hosts and the gpfdist server.
+- Appropriate SSL certificate files in the correct locations.
+
+## Step 1. Set up SSL certificates
+
+### Required certificate files
+
+The following certificate files must be present in the `$PGDATA/gpfdists`
directory on each Apache Cloudberry segment host:
+
+#### For full SSL authentication (recommended):
+
+- `client.key` - Client private key file
+- `client.crt` - Client certificate file
+- `root.crt` - Trusted certificate authorities file
+
+#### Certificate requirements by configuration:
+
+| verify_gpfdists_cert | --ssl_verify_peer | Required Certificate Files |
+|---------------------|-------------------|---------------------------|
+| on (default) | on (default) | `client.key`, `client.crt`, `root.crt` |
+| on | off | `root.crt` |
+| off | on | `client.key`, `client.crt` |
+| off | off | None |
+
+### Install certificates
+
+1. Create the gpfdists directory on each segment host:
+
+ ```shell
+ mkdir -p $PGDATA/gpfdists
+ ```
+
+2. Copy the certificate files to each segment host:
+
+ ```shell
+ # Copy to all segment hosts
+ scp client.key client.crt root.crt gpadmin@segment-host:$PGDATA/gpfdists/
+ ```
+
+3. Set appropriate permissions:
+
+ ```shell
+ chmod 600 $PGDATA/gpfdists/client.key
+ chmod 644 $PGDATA/gpfdists/client.crt
+ chmod 644 $PGDATA/gpfdists/root.crt
+ ```
+
+## Step 2. Start gpfdist with SSL
+
+Start the gpfdist utility with the `--ssl` option to enable secure connections:
+
+```shell
+gpfdist -p 8081 -d /data/load_files --ssl /path/to/certificates &
+```
+
+### SSL options for gpfdist
+
+- `--ssl <certificates_path>`: Enables SSL and specify certificate directory
+- `--ssl_verify_peer on|off`: Controls peer verification (default: on)
+
+### Example: Start multiple secure gpfdist instances
+
+```shell
+# Starts the first secure gpfdist instance.
+gpfdist -d /var/load_files1 -p 8081 --ssl /home/gpadmin/certs \
+ --ssl_verify_peer on -l /home/gpadmin/log1 &
+
+# Starts the second secure gpfdist instance.
+gpfdist -d /var/load_files2 -p 8082 --ssl /home/gpadmin/certs \
+ --ssl_verify_peer on -l /home/gpadmin/log2 &
+```
+
+## Step 3. Create external tables with gpfdists
+
+Use the `gpfdists://` protocol in the `LOCATION` clause to create secure
external tables:
+
+### Readable external table
+
+```sql
+CREATE EXTERNAL TABLE secure_sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server1:8081/sales/*.txt',
+ 'gpfdists://etl-server2:8082/sales/*.txt')
+FORMAT 'TEXT' (DELIMITER '|' NULL ' ');
+```
+
+### Writable external table
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE secure_export (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server1:8081/exports/sales_data.txt')
+FORMAT 'TEXT' (DELIMITER '|')
+DISTRIBUTED BY (transaction_id);
+```
+
+### With error handling
+
+```sql
+CREATE EXTERNAL TABLE secure_data_with_errors (
+ id int,
+ name text,
+ value decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server:8081/data/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 100;
+```
+
+## Configuration parameters
+
+### Apache Cloudberry parameters
+
+Configure these parameters in `postgresql.conf`:
+
+```ini
+# Enables/disables SSL certificate verification (default: on)
+verify_gpfdists_cert = on
+
+# Control segment parallelism
+gp_external_max_segs = 64
+```
+
+### gpfdist SSL parameters
+
+The gpfdist utility supports these SSL-related options:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--ssl` | Enable SSL and specify certificate path | Disabled |
+| `--ssl_verify_peer` | Verify client certificates | on |
+
+## Security best practices
+
+### Certificate management
+
+- Generate all certificates from a trusted certificate authority to ensure
proper validation and trust chain establishment.
+- Implement a regular certificate rotation schedule to enhance security and
prevent issues from certificate expiration.
+- Store private keys in secure locations with restricted access permissions,
ensuring only authorized personnel can access them.
+- Maintain secure backup copies of all certificate files to enable quick
recovery in case of system failures or corruption.
+
+### Network security
+
+- Configure firewall rules to restrict access to gpfdists ports, allowing only
authorized Apache Cloudberry segment hosts to connect.
+- Use secure network connections such as VPNs or private networks to prevent
unauthorized access to data transmission channels.
+- Implement continuous monitoring of SSL connections and certificate
expiration dates to proactively address security issues.
+
+### Access control
+
+- Apply the principle of least privilege by granting only the minimum
permissions necessary for users and applications to perform their required
functions.
+- Implement robust authentication mechanisms including multi-factor
authentication where appropriate to verify user identities.
+- Enable comprehensive audit logging to track all access attempts, successful
connections, and security-related events for compliance and security monitoring.
+
+## Troubleshooting
+
+### SSL connection errors
+
+Check certificate configuration:
+
+```shell
+# Verifies certificate files exist.
+ls -la $PGDATA/gpfdists/
+
+# Checks certificate validity.
+openssl x509 -in $PGDATA/gpfdists/client.crt -text -noout
+```
+
+### Troubleshoot certificate verification issues
+
+1. Verify that the `root.crt` file contains the correct certificate authority
chain and that all intermediate certificates are properly included for
validation.
+2. Check certificate expiration dates using tools like `openssl x509 -dates`
to ensure that certificates have not expired and plan for renewal well in
advance.
+3. Validate that the private key file `client.key` corresponds exactly to the
public certificate in `client.crt` using certificate validation tools.
+
+### Common error messages
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| "SSL certificate verify failed" | Invalid or expired certificate | Check
certificate validity and CA |
+| "SSL handshake failed" | SSL configuration mismatch | Verify SSL settings on
both sides |
+| "Permission denied" | Incorrect file permissions | Set proper permissions on
certificate files |
+
+### Debug SSL connections
+
+Enable verbose logging:
+
+```shell
+gpfdist -d /data -p 8081 --ssl /certs -V --ssl_verify_peer on
+```
+
+## Performance considerations
+
+- SSL encryption introduces computational overhead during data transmission,
which may reduce overall throughput compared to unencrypted connections,
especially for large data transfers.
+- Certificate caching mechanisms help reduce the performance impact of SSL
handshakes by reusing established secure connections across multiple data
transfer operations.
+- Deploy multiple gpfdists instances across different hosts or network
interfaces to distribute the load and achieve better aggregate throughput for
concurrent data operations.
+- Ensure that your network infrastructure provides adequate bandwidth and low
latency between Apache Cloudberry segments and gpfdist servers to minimize
performance bottlenecks.
+
+## Migration from gpfdist to gpfdists
+
+To migrate existing gpfdist usage to gpfdists:
+
+- Install and configure the required SSL certificates on all Apache Cloudberry
segment hosts, ensuring proper permissions and certificate chain validation.
+- Update all external table definitions to change protocol specifications from
`gpfdist://` to `gpfdists://` in their LOCATION clauses.
+- Restart your gpfdist file servers with the `--ssl` option enabled,
specifying the appropriate certificate directories and SSL verification
settings.
+- Thoroughly test secure connections from all Apache Cloudberry segments to
verify that data loading operations work correctly with the new encrypted
protocol.
+- Update any automation scripts, monitoring tools, and operational procedures
to account for the new secure protocol requirements and certificate management
tasks.
+
+### Example migration
+
+Before (insecure):
+
+```sql
+LOCATION ('gpfdist://etl-server:8081/data.txt')
+```
+
+After (secure):
+
+```sql
+LOCATION ('gpfdists://etl-server:8081/data.txt')
+```
+
+## Limitations
+
+- Cannot mix `gpfdist://` and `gpfdists://` protocols in the same external
table definition.
+- All Apache Cloudberry segment hosts must have properly configured SSL
certificates and trust relationships.
+- SSL encryption introduces computational overhead that may reduce data
transfer throughput compared to unencrypted gpfdist.
+- Ongoing certificate lifecycle management is required, including renewal,
rotation, and revocation processes.
+
+## Learn more
+
+- [Load Data Using gpfdist](/docs/data-loading/load-data-using-gpfdist.md)
+- [SSL Certificate Management](https://www.sslshopper.com/ssl-converter.html)
diff --git a/docs/data-loading/load-data-using-pxf.md
b/docs/data-loading/load-data-using-pxf.md
new file mode 100644
index 0000000000..dbac9b7b42
--- /dev/null
+++ b/docs/data-loading/load-data-using-pxf.md
@@ -0,0 +1,327 @@
+---
+title: Load Data Using PXF
+---
+
+# Load Data Using the Platform Extension Framework (PXF)
+
+The Apache Cloudberry Platform Extension Framework (PXF) is an extensible
framework that allows Apache Cloudberry to query external data files whose
metadata is not managed by the database. You can use its `pxf://` protocol to
access data residing in object store systems (Azure, Google Cloud Storage,
Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.
+
+PXF includes built-in connectors for accessing data that exists inside HDFS
files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also
create their own connectors to other data storage or processing engines.
+
+The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF
project and has been specifically adapted for it. The source code for this
adapted version is hosted by the Apache Software Foundation at
[apache/cloudberry-pxf](https://github.com/apache/cloudberry-pxf).
+
+The PXF `pxf` protocol is packaged as an Apache Cloudberry extension that
supports both reading from and writing to external data stores. The framework
includes a C-language extension and a Java service. After you configure and
initialize PXF, you start a single PXF JVM process on each Apache Cloudberry
segment host. This long-running process concurrently serves multiple query
requests.
+
+Before using the `pxf` protocol, you must explicitly initialize and start the
PXF service. You must also enable PXF in each database where it will be used
and grant permissions on the `pxf` protocol to the relevant Apache Cloudberry
users.
+
+## Install and Build PXF
+
+PXF is a component that must be built and installed separately. The source
code and detailed instructions for building, installing, and developing PXF are
available in the official `apache/cloudberry-pxf` repository. Before proceeding
with the configuration, follow the development guide in the repository to set
up PXF.
+
+**Build and Installation Guide**: [apache/cloudberry-pxf
README](https://github.com/apache/cloudberry-pxf/blob/main/README.md)
+
+## Configure PXF
+
+Before you can use PXF to access external data, you must configure and
initialize the service.
+
+### Initialize and Start PXF
+
+To use PXF, you must first initialize it. This process creates the necessary
configuration directory structure and files on each Apache Cloudberry segment
host. After initialization, you need to start the PXF service.
+
+1. Add the PXF binary directory to your `PATH`. This step ensures that you
can run `pxf` commands from any location.
+
+ ```shell
+ export PATH=/usr/local/pxf/bin:$PATH
+ ```
+
+ You can add this line to your `.bashrc` or `.zshrc` file to make it
permanent.
+
+2. Initialize PXF. The `prepare` command sets up the PXF configuration. This
command only needs to be run once.
+
+ ```shell
+ pxf prepare
+ ```
+
+3. Start the PXF service. This command starts a Java process on each segment
host that acts as the external data coordinator.
+
+ ```shell
+ pxf start
+ ```
+
+### Enable PXF in a database
+
+You must enable PXF in each database in which you want to use the `pxf`
protocol to access external data.
+
+To enable PXF in a database, you must create the PXF extension in the database:
+
+```sql
+CREATE EXTENSION pxf;
+```
+
+The `pxf` protocol is packaged as an extension with Apache Cloudberry, and you
must explicitly enable it in each database in which you plan to use it.
+
+### Grant access to PXF
+
+To allow non-superuser Apache Cloudberry roles to create external tables using
the `pxf` protocol, you must grant `SELECT` privileges on the `pxf` protocol to
each role:
+
+```sql
+GRANT SELECT ON PROTOCOL pxf TO <role_name>;
+```
+
+To allow non-superuser Apache Cloudberry roles to create writable external
tables using the `pxf` protocol, you must grant `INSERT` privileges on the
`pxf` protocol to each role:
+
+```sql
+GRANT INSERT ON PROTOCOL pxf TO <role_name>;
+```
+
+## PXF configuration overview
+
+PXF configuration includes server configuration and connector configuration. A
PXF *server* is a named configuration that provides access credentials and
other information required to access an external data source. A PXF *connector*
is the interface between PXF and the external data source.
+
+### Example: configure S3 connector
+
+Create a PXF configuration for accessing S3 data:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<configuration>
+ <property>
+ <name>fs.s3a.access.key</name>
+ <value>YOUR_ACCESS_KEY</value>
+ </property>
+ <property>
+ <name>fs.s3a.secret.key</name>
+ <value>YOUR_SECRET_KEY</value>
+ </property>
+ <property>
+ <name>fs.s3a.endpoint</name>
+ <value>s3.amazonaws.com</value>
+ </property>
+</configuration>
+```
+
+### Example: configure JDBC connector
+
+For accessing SQL databases:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<configuration>
+ <property>
+ <name>jdbc.driver</name>
+ <value>org.postgresql.Driver</value>
+ </property>
+ <property>
+ <name>jdbc.url</name>
+ <value>jdbc:postgresql://hostname:5432/database</value>
+ </property>
+ <property>
+ <name>jdbc.user</name>
+ <value>username</value>
+ </property>
+ <property>
+ <name>jdbc.password</name>
+ <value>password</value>
+ </property>
+</configuration>
+```
+
+## Step 3. Create PXF external tables
+
+PXF external tables use the `pxf://` protocol in the `LOCATION` clause. The
URL format varies depending on the data source and connector.
+
+### Read from HDFS
+
+Access text files stored in HDFS:
+
+```sql
+CREATE EXTERNAL TABLE sales_hdfs (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|');
+```
+
+### Read from Hive tables
+
+Access Hive tables directly:
+
+```sql
+CREATE EXTERNAL TABLE hive_sales (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Read from Parquet files
+
+Access Parquet files in object storage:
+
+```sql
+CREATE EXTERNAL TABLE parquet_data (
+ id bigint,
+ name text,
+ created_date date
+)
+LOCATION
('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Read from SQL databases
+
+Query external SQL databases:
+
+```sql
+CREATE EXTERNAL TABLE external_customers (
+ customer_id int,
+ customer_name text,
+ email text,
+ registration_date date
+)
+LOCATION
('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Write data with PXF
+
+Create writable external tables to export data:
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|')
+DISTRIBUTED BY (transaction_id);
+```
+
+## PXF URL format
+
+The PXF protocol URL follows this syntax:
+
+```
+pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]
+```
+
+Where:
+- `host:port`: Location of the external data source
+- `path-to-data`: Path to the specific data (file, directory, table, etc.)
+- `PROFILE`: PXF connector profile (e.g., `hdfs:text`, `hive`, `s3:parquet`,
`jdbc`)
+- `SERVER`: Named server configuration (optional)
+- `custom-option`: Additional connector-specific options
+
+## Common PXF profiles
+
+| Profile | Data Source | Format | Use Case |
+|---------|-------------|--------|----------|
+| `hdfs:text` | HDFS | Text files | Delimited text data |
+| `hdfs:avro` | HDFS | Avro files | Schema evolution support |
+| `hdfs:parquet` | HDFS | Parquet files | Columnar analytics |
+| `hdfs:orc` | HDFS | ORC files | Optimized row columnar |
+| `hive` | Hive | Various | Hive table access |
+| `hbase` | HBase | HBase | NoSQL data access |
+| `s3:text` | S3 | Text files | Cloud object storage |
+| `s3:parquet` | S3 | Parquet files | Cloud analytics |
+| `jdbc` | SQL Database | Various | External database access |
+
+## Performance optimization
+
+### Partition pruning
+
+PXF supports partition pruning for Hive tables:
+
+```sql
+SELECT * FROM hive_sales
+WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
+```
+
+### Predicate pushdown
+
+Enable predicate pushdown for better performance:
+
+```sql
+-- This filter can be pushed down to the external source
+SELECT * FROM external_customers
+WHERE registration_date > '2024-01-01';
+```
+
+### Parallel processing
+
+Leverage multiple files for better parallelism:
+
+```sql
+CREATE EXTERNAL TABLE multi_file_data (
+ id bigint,
+ data text
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER ',');
+```
+
+## Error handling
+
+Enable error logging for data quality monitoring:
+
+```sql
+CREATE EXTERNAL TABLE sales_with_errors (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|')
+LOG ERRORS SEGMENT REJECT LIMIT 50;
+```
+
+## Best practices
+
+- Select the most specific PXF profile that matches your data format and
storage system to ensure optimal performance and feature support.
+- Create reusable named server configurations to centralize connection
parameters and credentials, making it easier to manage multiple external data
sources.
+- Structure your external data using file sizes between 100MB and 1GB to
achieve optimal parallel processing across Apache Cloudberry segments.
+- Leverage compressed storage formats such as Parquet or ORC to reduce I/O
overhead and improve query performance.
+- Implement comprehensive error logging and monitoring in production
environments to quickly identify and resolve data access issues.
+- Establish proper authentication mechanisms and enable encryption for all
connections to sensitive external data sources.
+
+## Limitations
+
+- PXF requires the external data source to be accessible from all segment
hosts.
+- Some data sources may have specific version compatibility requirements.
+- Write operations depend on the connector's capabilities.
+- Performance can be affected by network latency to external sources.
+
+## Troubleshooting
+
+### Check PXF service status
+
+```shell
+pxf cluster status
+```
+
+### View PXF logs
+
+```shell
+pxf cluster logs
+```
+
+### Test connectivity
+
+```sql
+SELECT * FROM pxf_external_table LIMIT 5;
+```
+
+## Learn More
+
+For more details about the Apache Cloudberry Platform Extension Framework
(PXF), please refer to the official PXF project repository, which contains the
source code, documentation, and contribution guidelines.
+
+- [Apache Cloudberry PXF Project](https://github.com/apache/cloudberry-pxf)
diff --git a/docs/developer/write-a-foreign-data-wrapper.md
b/docs/developer/write-a-foreign-data-wrapper.md
new file mode 100644
index 0000000000..92140fd353
--- /dev/null
+++ b/docs/developer/write-a-foreign-data-wrapper.md
@@ -0,0 +1,411 @@
+---
+title: Write a Foreign Data Wrapper
+---
+
+# Write a Foreign Data Wrapper
+
+This guide outlines how to write a new foreign data wrapper (FDW) for Apache
Cloudberry. A foreign data wrapper is a library that consists of a set of
functions that the Apache Cloudberry server calls to access external data
sources. The FDW is responsible for fetching data from remote data stores and
returning it to the Apache Cloudberry executor.
+
+FDWs enable Apache Cloudberry to treat external data sources as if they were
regular database tables, allowing you to query external data using standard SQL.
+
+## Before you begin
+
+### Requirements
+
+When developing with the Apache Cloudberry foreign data wrapper API:
+
+- Your development system must have the same hardware and software
architecture as your Apache Cloudberry hosts
+- Code must be written in a compiled language such as C, using the version-1
interface
+- Symbol names in your object files must not conflict with each other or with
symbols defined in the Apache Cloudberry server
+- You should be familiar with the foreign table concepts described in
[Accessing External Data with Foreign Tables](/docs/external/g-foreign.html)
+
+### Header files
+
+The Apache Cloudberry header files for FDW development are located in
`$GPHOME/include/postgresql/server/`:
+
+- `foreign/fdwapi.h` - FDW API structures and callback function signatures
+- `foreign/foreign.h` - Foreign data wrapper helper structs and functions
+- `catalog/pg_foreign_table.h` - Foreign table definition
+- `catalog/pg_foreign_server.h` - Foreign server definition
+
+## FDW architecture
+
+An FDW consists of two main components:
+
+1. A handler function that returns a struct containing function pointers to
all the callback functions needed by Apache Cloudberry to interact with the
external data source.
+2. An optional validator function that validates configuration options
provided in `CREATE` and `ALTER` commands for the foreign data wrapper,
servers, user mappings, and foreign tables.
+
+### Handler function
+
+The handler function is SQL-invokable and returns a struct containing pointers
to callback functions:
+
+```c
+CREATE FUNCTION my_fdw_handler()
+ RETURNS fdw_handler
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+```
+
+### Validator function
+
+The validator function validates options for FDW objects:
+
+```c
+CREATE FUNCTION my_fdw_validator(text[], oid)
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+```
+
+## Callback functions
+
+The FDW API defines callback functions that Apache Cloudberry invokes when
scanning and updating foreign tables. The handler function returns a
`FdwRoutine` struct containing pointers to these functions.
+
+### Required scan-related callbacks
+
+These functions are required for all FDWs:
+
+#### GetForeignRelSize
+
+```c
+void GetForeignRelSize(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid)
+```
+
+Obtain relation size estimates for a foreign table. Called at the beginning of
planning.
+
+#### GetForeignPaths
+
+```c
+void GetForeignPaths(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid)
+```
+
+Create possible access paths for a scan on a foreign table. Must call
`create_foreignscan_path()` for Apache Cloudberry compatibility.
+
+#### GetForeignPlan
+
+```c
+ForeignScan *GetForeignPlan(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid,
+ ForeignPath *best_path,
+ List *tlist,
+ List *scan_clauses)
+```
+
+Create a ForeignScan plan node from the selected foreign access path.
+
+#### BeginForeignScan
+
+```c
+void BeginForeignScan(ForeignScanState *node,
+ int eflags)
+```
+
+Begin executing a foreign scan. Called during executor startup.
+
+#### IterateForeignScan
+
+```c
+TupleTableSlot *IterateForeignScan(ForeignScanState *node)
+```
+
+Fetch one row from the foreign source. Return NULL when no more rows are
available.
+
+#### ReScanForeignScan
+
+```c
+void ReScanForeignScan(ForeignScanState *node)
+```
+
+Restart the scan from the beginning.
+
+#### EndForeignScan
+
+```c
+void EndForeignScan(ForeignScanState *node)
+```
+
+End the scan and release resources.
+
+### Optional update-related callbacks
+
+If your FDW supports write operations, implement these functions:
+
+#### ExecForeignInsert
+
+```c
+TupleTableSlot *ExecForeignInsert(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Insert a single tuple into the foreign table.
+
+#### ExecForeignUpdate
+
+```c
+TupleTableSlot *ExecForeignUpdate(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Update a single tuple in the foreign table.
+
+#### ExecForeignDelete
+
+```c
+TupleTableSlot *ExecForeignDelete(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Delete a single tuple from the foreign table.
+
+## Apache Cloudberry considerations
+
+### The mpp_execute option
+
+Apache Cloudberry supports parallel execution through the `mpp_execute`
option. Your FDW should handle this option to determine where to request or
send data:
+
+```c
+ForeignTable *table = GetForeignTable(foreigntableid);
+if (table->exec_location == FTEXECLOCATION_ALL_SEGMENTS)
+{
+ // Execute on all segments in parallel
+}
+else if (table->exec_location == FTEXECLOCATION_ANY)
+{
+ // Execute on coordinator or any one segment
+}
+else if (table->exec_location == FTEXECLOCATION_COORDINATOR)
+{
+ // Execute on coordinator only (default)
+}
+```
+
+### Segment identification
+
+For parallel execution (`mpp_execute 'all segments'`), each segment must
determine which portion of the data is its responsibility:
+
+```c
+int segmentNumber = GpIdentity.segindex;
+int totalNumberOfSegments = getgpsegmentCount();
+```
+
+### Parallel write operations
+
+Apache Cloudberry supports parallel write operations only when `mpp_execute`
is set to `'all segments'`. For other settings, write operations are initiated
through the coordinator.
+
+## Example FDW implementation
+
+Here's a basic skeleton for an FDW:
+
+```c
+#include "postgres.h"
+#include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
+
+PG_MODULE_MAGIC;
+
+// Function declarations
+extern Datum my_fdw_handler(PG_FUNCTION_ARGS);
+extern Datum my_fdw_validator(PG_FUNCTION_ARGS);
+
+// FDW callback functions
+static void myGetForeignRelSize(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid);
+static void myGetForeignPaths(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid);
+// ... other callback functions
+
+PG_FUNCTION_INFO_V1(my_fdw_handler);
+Datum
+my_fdw_handler(PG_FUNCTION_ARGS)
+{
+ FdwRoutine *fdwroutine = makeNode(FdwRoutine);
+
+ // Required scan functions
+ fdwroutine->GetForeignRelSize = myGetForeignRelSize;
+ fdwroutine->GetForeignPaths = myGetForeignPaths;
+ fdwroutine->GetForeignPlan = myGetForeignPlan;
+ fdwroutine->BeginForeignScan = myBeginForeignScan;
+ fdwroutine->IterateForeignScan = myIterateForeignScan;
+ fdwroutine->ReScanForeignScan = myReScanForeignScan;
+ fdwroutine->EndForeignScan = myEndForeignScan;
+
+ // Optional write functions
+ fdwroutine->ExecForeignInsert = myExecForeignInsert;
+ fdwroutine->ExecForeignUpdate = myExecForeignUpdate;
+ fdwroutine->ExecForeignDelete = myExecForeignDelete;
+
+ PG_RETURN_POINTER(fdwroutine);
+}
+
+PG_FUNCTION_INFO_V1(my_fdw_validator);
+Datum
+my_fdw_validator(PG_FUNCTION_ARGS)
+{
+ List *options_list = untransformRelOptions(PG_GETARG_DATUM(0));
+ Oid catalog = PG_GETARG_OID(1);
+
+ // Validate options based on catalog type
+ // Implementation depends on your FDW's specific options
+
+ PG_RETURN_VOID();
+}
+```
+
+## Building and packaging
+
+### Using PGXS
+
+Create a Makefile using the PostgreSQL build extension infrastructure:
+
+```makefile
+MODULE_big = my_fdw
+OBJS = my_fdw.o
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+PG_CPPFLAGS = -I$(shell $(PG_CONFIG) --includedir)
+SHLIB_LINK = -L$(shell $(PG_CONFIG) --libdir)
+include $(PGXS)
+```
+
+### Creating an extension
+
+Create SQL script file (`my_fdw--1.0.sql`):
+
+```sql
+CREATE FUNCTION my_fdw_handler()
+ RETURNS fdw_handler
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION my_fdw_validator(text[], oid)
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FOREIGN DATA WRAPPER my_fdw
+ HANDLER my_fdw_handler
+ VALIDATOR my_fdw_validator;
+```
+
+Create control file (`my_fdw.control`):
+
+```ini
+# my_fdw extension
+comment = 'My custom foreign data wrapper'
+default_version = '1.0'
+module_pathname = '$libdir/my_fdw'
+relocatable = true
+```
+
+Update Makefile for extension:
+
+```makefile
+EXTENSION = my_fdw
+DATA = my_fdw--1.0.sql
+```
+
+## Deployment
+
+For Apache Cloudberry clusters:
+
+1. Install shared library to `$GPHOME/lib/postgresql/` on all hosts.
+2. Install extension files to `$GPHOME/share/postgresql/extension/` on all
hosts.
+3. Set permissions so `gpadmin` user can access all files.
+4. Install on all hosts - coordinator and all segment hosts.
+
+## Testing your FDW
+
+### Basic functionality test
+
+```sql
+-- Create extension
+CREATE EXTENSION my_fdw;
+
+-- Create foreign server
+CREATE SERVER my_server
+ FOREIGN DATA WRAPPER my_fdw
+ OPTIONS (host 'remote-host', port '5432');
+
+-- Create user mapping
+CREATE USER MAPPING FOR current_user
+ SERVER my_server
+ OPTIONS (user 'remote_user', password 'remote_password');
+
+-- Create foreign table
+CREATE FOREIGN TABLE my_foreign_table (
+ id integer,
+ name text
+) SERVER my_server
+OPTIONS (table_name 'remote_table');
+
+-- Test query
+SELECT * FROM my_foreign_table LIMIT 5;
+```
+
+### Parallel execution test
+
+```sql
+-- Test parallel execution
+CREATE FOREIGN TABLE parallel_test (
+ id integer,
+ data text
+) SERVER my_server
+OPTIONS (
+ table_name 'large_table',
+ mpp_execute 'all segments'
+);
+
+SELECT count(*) FROM parallel_test;
+```
+
+## Best practices
+
+- Implement comprehensive error handling throughout your FDW code, ensuring
that all failure modes are properly reported to users with clear, actionable
error messages.
+- Use PostgreSQL memory contexts correctly to prevent memory leaks and ensure
proper cleanup when queries are cancelled or encounter errors.
+- Design your FDW to handle bulk data operations efficiently and take
advantage of Apache Cloudberry's parallel processing capabilities where
appropriate.
+- Validate all user inputs and configuration parameters, implement secure
authentication mechanisms, and follow security best practices for external data
access.
+- Create thorough documentation that explains all configuration options,
connection parameters, and usage examples for administrators and developers.
+- Develop comprehensive test suites that cover various data types, edge cases,
error conditions, and performance scenarios across different external data
sources.
+
+## Debugging
+
+### Enable verbose logging
+
+Set in postgresql.conf:
+```ini
+log_min_messages = DEBUG1
+```
+
+### Use elog for debugging
+
+```c
+elog(DEBUG1, "FDW: Processing %d rows", row_count);
+elog(WARNING, "FDW: Connection failed, retrying...");
+```
+
+### Common issues
+
+1. Segmentation faults frequently occur due to improper memory management,
such as accessing freed memory or buffer overruns in C code.
+2. Symbol name conflicts can arise when multiple extensions define functions
with the same names; ensure all function names are unique and properly
namespaced.
+3. Parallel execution issues are often difficult to debug; start testing with
single-segment configurations before enabling multi-segment parallel processing.
+4. Performance bottlenecks may appear in data conversion or network
operations; use profiling tools to identify and optimize critical code paths.
+
+## Learn more
+
+- [Foreign Data Wrapper Helper
Functions](https://www.postgresql.org/docs/12/fdw-helpers.html)
+- [Foreign Data Wrapper Callback
Routines](https://www.postgresql.org/docs/12/fdw-callbacks.html)
diff --git a/sidebars.ts b/sidebars.ts
index 0c9376c198..2e0aabbab2 100644
--- a/sidebars.ts
+++ b/sidebars.ts
@@ -48,14 +48,31 @@ const sidebars: SidebarsConfig = {
link: {
type: "doc",
id: 'data-loading/index',
- },
- items: [
- {
- type: 'category',
- label: 'Load Data from Local Files',
- items: ['data-loading/load-data-using-copy',
'data-loading/load-data-using-gpfdist',
'data-loading/load-data-using-file-protocol','data-loading/load-data-using-gpload']
},
- 'data-loading/load-data-from-web-services',
'data-loading/load-data-from-kafka-using-fdw']
+ items: [
+ {
+ type: 'category',
+ label: 'From Local or Network Files',
+ items: [
+ 'data-loading/load-data-using-copy',
+ 'data-loading/load-data-using-gpfdist',
+ 'data-loading/load-data-using-gpfdists',
+ 'data-loading/load-data-using-file-protocol',
+ 'data-loading/load-data-using-gpload',
+ ],
+ },
+ {
+ type: 'category',
+ label: 'From Cloud or Big Data Systems',
+ items: [
+ 'data-loading/load-data-from-s3',
+ 'data-loading/load-data-using-pxf',
+ ],
+ },
+ 'data-loading/load-data-from-web-services',
+ 'data-loading/load-data-from-kafka-using-fdw',
+ 'data-loading/handle-data-errors',
+ ],
},
{
@@ -189,7 +206,7 @@ const sidebars: SidebarsConfig = {
{
type: 'category',
label: 'Developer',
- items: ['developer/develop-extensions-using-rust']
+ items: ['developer/develop-extensions-using-rust',
'developer/write-a-foreign-data-wrapper']
},
{
diff --git a/versioned_docs/version-2.x/data-loading/handle-data-errors.md
b/versioned_docs/version-2.x/data-loading/handle-data-errors.md
new file mode 100644
index 0000000000..1a0b357606
--- /dev/null
+++ b/versioned_docs/version-2.x/data-loading/handle-data-errors.md
@@ -0,0 +1,331 @@
+---
+title: Handle Data Loading Errors
+---
+
+# Handle Data Loading Errors
+
+Real-world data is often imperfect, containing formatting errors, missing
values, or inconsistent data types. Apache Cloudberry provides robust error
handling mechanisms that allow you to load correctly formatted data while
isolating and managing problematic rows, ensuring your ETL processes are
resilient and reliable.
+
+By default, if external table data contains any error, the entire load
operation fails and no data is loaded. With error handling enabled, you can
load valid data and deal with problematic rows separately.
+
+## Error handling modes
+
+Apache Cloudberry supports two error handling approaches:
+
+1. Single row error isolation allows the system to skip individual problematic
rows and continue processing the remaining valid data, preventing entire
operations from failing due to isolated data quality issues.
+2. Error logging functionality captures comprehensive details about
problematic data rows, including error descriptions, line numbers, and the
actual data that caused the failure, enabling thorough analysis and remediation.
+
+## Single row error isolation
+
+### Basic error isolation
+
+Enable single row error isolation by adding `SEGMENT REJECT LIMIT` to your
external table definition:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_with_errors (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+SEGMENT REJECT LIMIT 100;
+```
+
+This configuration allows up to 100 rows with errors per segment before the
operation fails.
+
+### Percentage-based limits
+
+You can also specify error limits as a percentage:
+
+```sql
+CREATE EXTERNAL TABLE sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+SEGMENT REJECT LIMIT 5 PERCENT;
+```
+
+This allows up to 5% of rows to contain errors before failing.
+
+## Error logging
+
+### Enable error logging
+
+Use `LOG ERRORS` to capture detailed information about rejected rows:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_logged (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS
+SEGMENT REJECT LIMIT 50;
+```
+
+### Persistent error logging
+
+For long-term error analysis, use persistent error logging:
+
+```sql
+CREATE EXTERNAL TABLE sales_data_persistent (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS PERSISTENTLY
+SEGMENT REJECT LIMIT 25;
+```
+
+## View error information
+
+### Query error logs
+
+When error logging is enabled, Apache Cloudberry creates error log tables that
you can query:
+
+```sql
+-- View recent errors from the current session
+SELECT * FROM gp_read_error_log('sales_data_logged');
+```
+
+### Error log table structure
+
+The error log contains these columns:
+
+| Column | Description |
+|--------|-------------|
+| `cmdtime` | Timestamp when the error occurred |
+| `relname` | Name of the external table |
+| `filename` | Source file containing the error |
+| `linenum` | Line number in the source file |
+| `bytenum` | Byte position in the source file |
+| `errmsg` | Error message description |
+| `rawdata` | Raw data that caused the error |
+| `rawbytes` | Raw bytes of the problematic data |
+
+### Example error analysis
+
+```sql
+-- Find the most common error types
+SELECT errmsg, COUNT(*) as error_count
+FROM gp_read_error_log('sales_data_logged')
+GROUP BY errmsg
+ORDER BY error_count DESC;
+
+-- View specific error details
+SELECT cmdtime, filename, linenum, errmsg, rawdata
+FROM gp_read_error_log('sales_data_logged')
+WHERE errmsg LIKE '%invalid input syntax%'
+ORDER BY cmdtime DESC;
+```
+
+## Common data errors and solutions
+
+### Data type conversion errors
+
+**Error**: `invalid input syntax for type numeric`
+**Cause**: Non-numeric data in numeric columns
+**Solution**: Clean data or use text columns with post-processing
+
+```sql
+-- Original problematic data: "N/A" in amount column
+-- Solution: Use text type and handle conversion later
+CREATE EXTERNAL TABLE sales_flexible (
+ transaction_id int,
+ product_name text,
+ sale_date text, -- Use text for flexible parsing
+ amount text -- Use text to handle "N/A" values
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 10 PERCENT;
+```
+
+### Date format issues
+
+**Error**: `invalid input syntax for type date`
+**Cause**: Inconsistent date formats
+**Solution**: Standardize date formats or use flexible parsing
+
+```sql
+-- Handle multiple date formats in post-processing
+SELECT
+ transaction_id,
+ product_name,
+ CASE
+ WHEN sale_date ~ '^\d{4}-\d{2}-\d{2}$' THEN sale_date::date
+ WHEN sale_date ~ '^\d{2}/\d{2}/\d{4}$' THEN to_date(sale_date,
'MM/DD/YYYY')
+ ELSE NULL
+ END as parsed_date,
+ amount::decimal(10,2)
+FROM sales_flexible
+WHERE sale_date IS NOT NULL;
+```
+
+### Miss or extra columns
+
+**Error**: `extra data after last expected column`
+**Cause**: Inconsistent number of columns
+**Solution**: Use more flexible table definition
+
+```sql
+-- Add extra optional columns to handle variable column counts
+CREATE EXTERNAL TABLE flexible_sales (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2),
+ extra_field1 text, -- Optional fields
+ extra_field2 text,
+ extra_field3 text
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
+```
+
+### Character encoding issues
+
+**Error**: `invalid byte sequence`
+**Cause**: Character encoding mismatch
+**Solution**: Specify correct encoding
+
+```sql
+CREATE EXTERNAL TABLE encoded_data (
+ id int,
+ description text
+)
+LOCATION ('gpfdist://etl-server:8081/data/*.txt')
+FORMAT 'TEXT' (DELIMITER '|')
+ENCODING 'LATIN1' -- Specify encoding
+LOG ERRORS SEGMENT REJECT LIMIT 5 PERCENT;
+```
+
+## Error handling strategies
+
+### Two-phase loading
+
+Use a staging approach for complex data cleaning:
+
+```sql
+-- Phase 1: Load into staging table with flexible types
+CREATE EXTERNAL TABLE sales_staging (
+ transaction_id text,
+ product_name text,
+ sale_date text,
+ amount text,
+ raw_line text -- Store entire row for complex cases
+)
+LOCATION ('gpfdist://etl-server:8081/sales/*.csv')
+FORMAT 'CSV'
+LOG ERRORS SEGMENT REJECT LIMIT 20 PERCENT;
+
+-- Phase 2: Clean and insert into final table
+INSERT INTO sales_final (transaction_id, product_name, sale_date, amount)
+SELECT
+ transaction_id::int,
+ product_name,
+ sale_date::date,
+ amount::decimal(10,2)
+FROM sales_staging
+WHERE transaction_id ~ '^\d+$' -- Validate numeric ID
+ AND sale_date ~ '^\d{4}-\d{2}-\d{2}$' -- Validate date format
+ AND amount ~ '^\d+\.?\d*$'; -- Validate amount format
+```
+
+### Error threshold monitoring
+
+Set up monitoring for error rates:
+
+```sql
+-- Function to check error rate
+CREATE OR REPLACE FUNCTION check_error_rate(table_name text, threshold_percent
numeric)
+RETURNS boolean AS $$
+DECLARE
+ error_count int;
+ total_count int;
+ error_rate numeric;
+BEGIN
+ SELECT COUNT(*) INTO error_count FROM gp_read_error_log(table_name);
+
+ -- Estimate total processed rows (depends on your tracking method)
+ SELECT reltuples INTO total_count FROM pg_class WHERE relname = table_name;
+
+ IF total_count > 0 THEN
+ error_rate := (error_count::numeric / total_count) * 100;
+ RETURN error_rate <= threshold_percent;
+ END IF;
+
+ RETURN true;
+END;
+$$ LANGUAGE plpgsql;
+
+-- Usage
+SELECT check_error_rate('sales_data_logged', 5.0); -- Check if error rate is
under 5%
+```
+
+## Best practices
+
+### Design for errors
+
+1. Begin with permissive table definitions using text data types and generous
error limits to understand the full scope of data quality issues before
implementing strict validation rules.
+2. Implement data validation incrementally by gradually adding constraints and
type conversions as you gain confidence in data quality and identify patterns
in the error logs.
+3. Establish regular monitoring and review processes for error logs to
identify systematic data quality issues and trends that may indicate problems
with source systems or data processing pipelines.
+
+### Error limit guidelines
+
+| Data Quality | Suggested Reject Limit | Use Case |
+|--------------|----------------------|----------|
+| High quality | 1-10 rows | Production systems |
+| Medium quality | 1-5% | Development/testing |
+| Poor quality | 10-20% | Initial data exploration |
+| Unknown quality | 50% | Data discovery phase |
+
+### Operational procedures
+
+1. Establish a regular schedule for cleaning up old error logs to prevent
excessive storage consumption and maintain system performance.
+2. Configure monitoring systems to alert when error rates exceed predefined
thresholds, enabling quick response to data quality issues.
+3. Maintain communication channels with data providers to share error patterns
and collaborate on improving source data quality.
+
+### Performance considerations
+
+1. Error logging functionality introduces some computational overhead during
data loading operations, so consider this impact when processing large datasets.
+2. Setting higher segment reject limits allows more rows to be processed
before the operation fails, but this also means more resources are consumed
analyzing problematic data.
+3. Error log tables can grow significantly in production environments, so
implement monitoring to track storage usage and prevent disk space issues.
+
+## Troubleshooting
+
+### High error rates
+
+If you are experiencing high error rates:
+
+1. Analyze the error log entries to identify patterns or systematic issues in
your data, such as consistent formatting problems or missing values in specific
columns.
+2. Work with data providers to verify the quality and consistency of source
data, including checking for recent changes in data formats or processing.
+3. Carefully review your external table definitions to ensure that column data
types, delimiters, and format specifications accurately match the actual data
structure.
+4. Start troubleshooting with small data samples to isolate issues quickly
before processing larger datasets.
+
+### Performance issues
+
+If error handling is impacting performance:
+
+1. Fine-tune your segment reject limits to balance between fault tolerance and
processing efficiency, avoiding unnecessarily high thresholds that waste
resources.
+2. For complex data with known quality issues, consider implementing a
two-phase loading process using staging tables with flexible data types.
+3. Break large data loads into smaller, manageable batches to reduce memory
pressure and improve error isolation.
+4. Continuously monitor system resources including memory usage, disk I/O, and
storage consumption during data loading operations.
+
+## Learn more
+
+- [Load Data Using gpfdist](/docs/data-loading/load-data-using-gpfdist.md)
+- [Load Data Using COPY](/docs/data-loading/load-data-using-copy.md)
diff --git a/versioned_docs/version-2.x/data-loading/index.md
b/versioned_docs/version-2.x/data-loading/index.md
index cdf9b96263..7c575533b7 100644
--- a/versioned_docs/version-2.x/data-loading/index.md
+++ b/versioned_docs/version-2.x/data-loading/index.md
@@ -23,9 +23,12 @@ Apache Cloudberry offers multiple data loading solutions,
and you can select dif
| -------------------------- |
----------------------------------------------------------- |
------------------------------------------------------------ | -------- |
| [`copy`](/docs/data-loading/load-data-using-copy.md) |
Local file system<br /><br />• Coordinator node host (for a single file)<br />•
Segment node host (for multiple files) | • TXT<br />• CSV<br />• Binary
| No |
| [`file://` protocol](/docs/data-loading/load-data-using-file-protocol.md)
| Local file system (local segment host, accessible only by superuser) | •
TXT<br />• CSV | Yes |
-| [`gpfdist`](/docs/data-loading/load-data-using-gpfdist.md) | Local host
files or files accessible via internal network | • TXT<br />• CSV<br />• Any
delimited text format supported by the `FORMAT` clause<br />• XML and JSON
(requires conversion to text format via YAML configuration file) | Yes |
|
+| [`gpfdist`](/docs/data-loading/load-data-using-gpfdist.md) | Local host
files or files accessible via internal network | • TXT<br />• CSV<br />• Any
delimited text format supported by the `FORMAT` clause<br />• XML and JSON
(requires conversion to text format via YAML configuration file) | Yes |
+| [`gpfdists`](/docs/data-loading/load-data-using-gpfdists.md) (secure
`gpfdist`) | Local host files or files accessible via internal network (with
SSL encryption) | • TXT<br />• CSV<br />• Any delimited text format supported
by the `FORMAT` clause<br />• XML and JSON (requires conversion to text format
via YAML configuration file) | Yes |
| [Batch loading using `gpload`](/docs/data-loading/load-data-using-gpload.md)
(with `gpfdist` as the underlying worker) | Local host files or files
accessible via internal network | • TXT<br />• CSV<br />• Any delimited text
format supported by the `FORMAT` clause<br />• XML and JSON (require conversion
to text format via YAML configuration file) | Yes |
-| [Creating external web
tables](/docs/data-loading/load-data-from-web-services.md) | Data
pulled from network services or from any source accessible by command lines | •
TXT<br />• CSV | Yes |
+| [`s3://` protocol](/docs/data-loading/load-data-from-s3.md) | Amazon
S3 and S3-compatible object stores | • TXT<br />• CSV<br />• Any delimited text
format supported by the `FORMAT` clause
| Yes |
+| [`pxf://` protocol](/docs/data-loading/load-data-using-pxf.md) |
Object stores (S3, Azure, GCS), Hadoop systems (HDFS, Hive, HBase), SQL
databases | • Text<br />• Avro<br />• JSON<br />• Parquet<br />• ORC<br />• CSV
| Yes |
+| [Create external web
tables](/docs/data-loading/load-data-from-web-services.md) | Data
pulled from network services or from any source accessible by command lines | •
TXT<br />• CSV | Yes |
| [Kafka FDW](/docs/data-loading/load-data-from-kafka-using-fdw.md) |
Streaming data from Apache Kafka | • JSON<br />• CSV
| No |
## Learn more
diff --git a/versioned_docs/version-2.x/data-loading/load-data-from-s3.md
b/versioned_docs/version-2.x/data-loading/load-data-from-s3.md
new file mode 100644
index 0000000000..f9776e151b
--- /dev/null
+++ b/versioned_docs/version-2.x/data-loading/load-data-from-s3.md
@@ -0,0 +1,184 @@
+---
+title: Load Data from Amazon S3
+---
+
+# Load Data from Amazon S3 Using the `s3` Protocol
+
+The `s3` protocol is used in a URL that specifies the location of an Amazon S3
bucket and a prefix to use for reading or writing files in the bucket.
+
+Amazon Simple Storage Service (Amazon S3) provides secure, durable,
highly-scalable object storage. For information about Amazon S3, see [Amazon
S3](https://aws.amazon.com/s3/).
+
+You can define read-only external tables that use existing data files in the
S3 bucket for table data, or writable external tables that store the data from
`INSERT` operations to files in the S3 bucket. Apache Cloudberry uses the S3
URL and prefix specified in the protocol URL either to select one or more files
for a read-only table, or to define the location and filename format to use
when uploading S3 files for `INSERT` operations to writable tables.
+
+The `s3` protocol also supports [Dell Elastic Cloud
Storage](https://www.dell.com/en-us/dt/learn/data-storage/ecs.htm) (ECS), an
Amazon S3 compatible service.
+
+:::note
+The `pxf` protocol can access data in S3 and other object store systems such
as Azure, Google Cloud Storage, and Minio. The `pxf` protocol can also access
data in external Hadoop systems (HDFS, Hive, HBase), and SQL databases. See
[`pxf://` protocol](/docs/data-loading/load-data-using-pxf.md).
+:::
+
+## Configure the s3 protocol
+
+You must configure the `s3` protocol before you can use it. Perform these
steps in each database in which you want to use the protocol:
+
+1. Create the read and write functions for the `s3` protocol library:
+
+ ```sql
+ CREATE OR REPLACE FUNCTION write_to_s3() RETURNS integer AS
+ '$libdir/gps3ext.so', 's3_export' LANGUAGE C STABLE;
+ ```
+
+ ```sql
+ CREATE OR REPLACE FUNCTION read_from_s3() RETURNS integer AS
+ '$libdir/gps3ext.so', 's3_import' LANGUAGE C STABLE;
+ ```
+
+2. Declare the `s3` protocol and specify the read and write functions you
created in the previous step:
+
+ To allow only Apache Cloudberry superusers to use the protocol, create it
as follows:
+
+ ```sql
+ CREATE PROTOCOL s3 (writefunc = write_to_s3, readfunc = read_from_s3);
+ ```
+
+ If you want to permit non-superusers to use the `s3` protocol, create it
as a `TRUSTED` protocol and `GRANT` access to those users. For example:
+
+ ```sql
+ CREATE TRUSTED PROTOCOL s3 (writefunc = write_to_s3, readfunc =
read_from_s3);
+ GRANT ALL ON PROTOCOL s3 TO user1, user2;
+ ```
+
+ :::note
+ The protocol name `s3` must be the same as the protocol of the URL
specified for the external table that you create to access an S3 resource.
+ :::
+
+ The corresponding function is called by every Apache Cloudberry segment
instance.
+
+## Use s3 external tables
+
+Follow these basic steps to use the `s3` protocol with Apache Cloudberry
external tables. Each step includes links to relevant topics from which you can
obtain more information.
+
+1. [Configure the s3 Protocol](#configure-the-s3-protocol).
+2. Create the `s3` protocol configuration file:
+
+ 1. Create a template `s3` protocol configuration file using the
`gpcheckcloud` utility:
+
+ ```shell
+ gpcheckcloud -t > ./mytest_s3.config
+ ```
+
+ 2. (Optional) Edit the template file to specify the `accessid` and
`secret` authentication credentials required to connect to the S3 location.
+
+3. Apache Cloudberry can access an `s3` protocol configuration file when the
file is located on each segment host or when the file is served up by an
`http/https` server. Identify where you plan to locate the configuration file,
and note the location and configuration option (if applicable).
+
+ If you are relying on the AWS credential file to authenticate, this file
must reside at `~/.aws/credentials` on each Apache Cloudberry segment host.
+
+4. Use the `gpcheckcloud` utility to validate connectivity to the S3 bucket.
You must specify the S3 endpoint name and bucket that you want to check.
+
+ For example, if the `s3` protocol configuration file resides in the
default location, you would run the following command:
+
+ ```shell
+ gpcheckcloud -c "s3://<s3-endpoint>/<s3-bucket>"
+ ```
+
+ `gpcheckcloud` attempts to connect to the S3 endpoint and lists any files
in the S3 bucket, if available. A successful connection ends with the message:
+
+ ```
+ Your configuration works well.
+ ```
+
+ You can optionally use `gpcheckcloud` to validate uploading to and
downloading from the S3 bucket.
+
+5. Create an s3 external table by specifying an `s3` protocol URL in the
`CREATE EXTERNAL TABLE` command, `LOCATION` clause.
+
+### Create a readable S3 external table
+
+For reading data from S3, specify the S3 location and file pattern:
+
+```sql
+CREATE EXTERNAL TABLE sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/sales-data/
config=/path/to/s3.config')
+FORMAT 'CSV' (HEADER);
+```
+
+### Create a writable S3 external table
+
+For writing data to S3:
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE sales_export (LIKE sales_data)
+LOCATION ('s3://s3-us-west-2.amazonaws.com/your-bucket/exports/
config=/path/to/s3.config')
+FORMAT 'CSV'
+DISTRIBUTED BY (transaction_id);
+```
+
+ For read-only s3 tables, the URL defines the location and prefix used to
select existing data files that comprise the s3 table. For example:
+
+ ```sql
+ CREATE READABLE EXTERNAL TABLE S3TBL (date text, time text, amt int)
+
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
config=/home/gpadmin/aws_s3/s3.conf')
+ FORMAT 'csv';
+ ```
+
+ For writable s3 tables, the protocol URL defines the S3 location in which
Apache Cloudberry writes the data files that back the table for `INSERT`
operations. You can also specify a prefix that Apache Cloudberry will add to
the files that it creates. For example:
+
+ ```sql
+ CREATE WRITABLE EXTERNAL TABLE S3WRIT (LIKE S3TBL)
+
LOCATION('s3://s3-us-west-2.amazonaws.com/s3test.example.com/dataset1/normal/
config=/home/gpadmin/aws_s3/s3.conf')
+ FORMAT 'csv';
+ ```
+
+## About the s3 Protocol LOCATION URL
+
+When you use the `s3` protocol, you specify an S3 file location and optional
configuration file location and region parameters in the `LOCATION` clause of
the `CREATE EXTERNAL TABLE` command. The syntax follows:
+
+```
+'s3://<S3_endpoint>[:<port>]/<bucket_name>/[<S3_prefix>] [region=<S3_region>]
[config=<config_file_location> | config_server=<url>] [section=<section_name>]'
+```
+
+The `s3` protocol requires that you specify the S3 endpoint and S3 bucket
name. Each Apache Cloudberry segment host must have access to the S3 location.
The optional S3_prefix value is used to select files for read-only S3 tables,
or as a filename prefix to use when uploading files for s3 writable tables.
+
+:::note
+The Apache Cloudberry `s3` protocol URL must include the S3 endpoint hostname.
+:::
+
+To specify an ECS endpoint (an Amazon S3 compatible service) in the `LOCATION`
clause, you must set the `s3` protocol configuration file parameter `version`
to `2`. The `version` parameter controls whether the `region` parameter is used
in the `LOCATION` clause. You can also specify an Amazon S3 location when the
`version` parameter is 2.
+
+:::note
+Although the S3_prefix is an optional part of the syntax, you should always
include an S3 prefix for both writable and read-only s3 tables to separate
datasets as part of the `CREATE EXTERNAL TABLE` syntax.
+:::
+
+For writable s3 tables, the `s3` protocol URL specifies the endpoint and
bucket name where Apache Cloudberry uploads data files for the table. The S3
file prefix is used for each new file uploaded to the S3 location as a result
of inserting data to the table.
+
+For read-only s3 tables, the S3 file prefix is optional. If you specify an
S3_prefix, then the `s3` protocol selects all files that start with the
specified prefix as data files for the external table. The `s3` protocol does
not use the slash character (`/`) as a delimiter, so a slash character
following a prefix is treated as part of the prefix itself.
+
+For example, consider the following 5 files that each have the S3_endpoint
named `s3-us-west-2.amazonaws.com` and the bucket_name `test1`:
+
+```
+s3://s3-us-west-2.amazonaws.com/test1/abc
+s3://s3-us-west-2.amazonaws.com/test1/abc/
+s3://s3-us-west-2.amazonaws.com/test1/abc/xx
+s3://s3-us-west-2.amazonaws.com/test1/abcdef
+s3://s3-us-west-2.amazonaws.com/test1/abcdefff
+```
+
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abc`,
then the `abc` prefix selects all 5 files.
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abc/`,
then the `abc/` prefix selects the files
`s3://s3-us-west-2.amazonaws.com/test1/abc/` and
`s3://s3-us-west-2.amazonaws.com/test1/abc/xx`.
+- If the S3 URL is provided as `s3://s3-us-west-2.amazonaws.com/test1/abcd`,
then the `abcd` prefix selects the files
`s3://s3-us-west-2.amazonaws.com/test1/abcdef` and
`s3://s3-us-west-2.amazonaws.com/test1/abcdefff`
+
+Wildcard characters are not supported in an S3_prefix; however, the S3 prefix
functions as if a wildcard character immediately followed the prefix itself.
+
+All of the files selected by the S3 URL (S3_endpoint/bucket_name/S3_prefix)
are used as the source for the external table, so they must have the same
format. Each file must also contain complete data rows. A data row cannot be
split between files.
+
+You use the `config` or `config_server` parameter to specify the location of
the required `s3` protocol configuration file that contains AWS connection
credentials and communication parameters.
+
+Use the `section` parameter to specify the name of the configuration file
section from which the `s3` protocol reads configuration parameters. The
default `section` is named `default`. When you specify the section name in the
configuration file, enclose it in brackets (for example, `[default]`).
+
+## Learn more
+
+- [Amazon S3 Documentation](https://aws.amazon.com/documentation/s3/)
+- [AWS S3
Endpoints](https://docs.aws.amazon.com/general/latest/gr/rande.html#s3_region)
diff --git
a/versioned_docs/version-2.x/data-loading/load-data-using-gpfdists.md
b/versioned_docs/version-2.x/data-loading/load-data-using-gpfdists.md
new file mode 100644
index 0000000000..4df834f474
--- /dev/null
+++ b/versioned_docs/version-2.x/data-loading/load-data-using-gpfdists.md
@@ -0,0 +1,262 @@
+---
+title: Load Data Using gpfdists (Secure)
+---
+
+# Load Data Securely Using `gpfdists`
+
+The `gpfdists` protocol is a secure version of the `gpfdist` protocol that
enables encrypted communication between Apache Cloudberry and the gpfdist file
server. When you use `gpfdists`, all data transfer is encrypted using SSL,
protecting against eavesdropping and man-in-the-middle attacks.
+
+`gpfdists` provides the same high-performance parallel data loading
capabilities as `gpfdist`, but with additional security features essential for
production environments handling sensitive data.
+
+## Security features
+
+- All data transmitted between Apache Cloudberry segments and gpfdist servers
is encrypted using SSL/TLS protocols, protecting against eavesdropping and data
interception.
+- Mutual authentication is enforced through client certificates, ensuring that
both Apache Cloudberry and gpfdist servers verify each other's identities
before establishing connections.
+- The implementation uses TLSv1 protocol with AES_128_CBC_SHA encryption
algorithm to provide strong cryptographic protection for data in transit.
+- Secure server identification mechanisms prevent unauthorized systems from
masquerading as legitimate gpfdist servers, protecting against
man-in-the-middle attacks.
+
+## Before you begin
+
+To use `gpfdists`, make sure:
+
+- SSL certificates configured on all segment hosts.
+- gpfdist utility available on the file server host.
+- Network connectivity between segment hosts and the gpfdist server.
+- Appropriate SSL certificate files in the correct locations.
+
+## Step 1. Set up SSL certificates
+
+### Required certificate files
+
+The following certificate files must be present in the `$PGDATA/gpfdists`
directory on each Apache Cloudberry segment host:
+
+#### For full SSL authentication (recommended):
+
+- `client.key` - Client private key file
+- `client.crt` - Client certificate file
+- `root.crt` - Trusted certificate authorities file
+
+#### Certificate requirements by configuration:
+
+| verify_gpfdists_cert | --ssl_verify_peer | Required Certificate Files |
+|---------------------|-------------------|---------------------------|
+| on (default) | on (default) | `client.key`, `client.crt`, `root.crt` |
+| on | off | `root.crt` |
+| off | on | `client.key`, `client.crt` |
+| off | off | None |
+
+### Install certificates
+
+1. Create the gpfdists directory on each segment host:
+
+ ```shell
+ mkdir -p $PGDATA/gpfdists
+ ```
+
+2. Copy the certificate files to each segment host:
+
+ ```shell
+ # Copy to all segment hosts
+ scp client.key client.crt root.crt gpadmin@segment-host:$PGDATA/gpfdists/
+ ```
+
+3. Set appropriate permissions:
+
+ ```shell
+ chmod 600 $PGDATA/gpfdists/client.key
+ chmod 644 $PGDATA/gpfdists/client.crt
+ chmod 644 $PGDATA/gpfdists/root.crt
+ ```
+
+## Step 2. Start gpfdist with SSL
+
+Start the gpfdist utility with the `--ssl` option to enable secure connections:
+
+```shell
+gpfdist -p 8081 -d /data/load_files --ssl /path/to/certificates &
+```
+
+### SSL options for gpfdist
+
+- `--ssl <certificates_path>`: Enables SSL and specify certificate directory
+- `--ssl_verify_peer on|off`: Controls peer verification (default: on)
+
+### Example: Start multiple secure gpfdist instances
+
+```shell
+# Starts the first secure gpfdist instance.
+gpfdist -d /var/load_files1 -p 8081 --ssl /home/gpadmin/certs \
+ --ssl_verify_peer on -l /home/gpadmin/log1 &
+
+# Starts the second secure gpfdist instance.
+gpfdist -d /var/load_files2 -p 8082 --ssl /home/gpadmin/certs \
+ --ssl_verify_peer on -l /home/gpadmin/log2 &
+```
+
+## Step 3. Create external tables with gpfdists
+
+Use the `gpfdists://` protocol in the `LOCATION` clause to create secure
external tables:
+
+### Readable external table
+
+```sql
+CREATE EXTERNAL TABLE secure_sales_data (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server1:8081/sales/*.txt',
+ 'gpfdists://etl-server2:8082/sales/*.txt')
+FORMAT 'TEXT' (DELIMITER '|' NULL ' ');
+```
+
+### Writable external table
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE secure_export (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server1:8081/exports/sales_data.txt')
+FORMAT 'TEXT' (DELIMITER '|')
+DISTRIBUTED BY (transaction_id);
+```
+
+### With error handling
+
+```sql
+CREATE EXTERNAL TABLE secure_data_with_errors (
+ id int,
+ name text,
+ value decimal(10,2)
+)
+LOCATION ('gpfdists://etl-server:8081/data/*.csv')
+FORMAT 'CSV' (HEADER)
+LOG ERRORS SEGMENT REJECT LIMIT 100;
+```
+
+## Configuration parameters
+
+### Apache Cloudberry parameters
+
+Configure these parameters in `postgresql.conf`:
+
+```ini
+# Enables/disables SSL certificate verification (default: on)
+verify_gpfdists_cert = on
+
+# Control segment parallelism
+gp_external_max_segs = 64
+```
+
+### gpfdist SSL parameters
+
+The gpfdist utility supports these SSL-related options:
+
+| Parameter | Description | Default |
+|-----------|-------------|---------|
+| `--ssl` | Enable SSL and specify certificate path | Disabled |
+| `--ssl_verify_peer` | Verify client certificates | on |
+
+## Security best practices
+
+### Certificate management
+
+- Generate all certificates from a trusted certificate authority to ensure
proper validation and trust chain establishment.
+- Implement a regular certificate rotation schedule to enhance security and
prevent issues from certificate expiration.
+- Store private keys in secure locations with restricted access permissions,
ensuring only authorized personnel can access them.
+- Maintain secure backup copies of all certificate files to enable quick
recovery in case of system failures or corruption.
+
+### Network security
+
+- Configure firewall rules to restrict access to gpfdists ports, allowing only
authorized Apache Cloudberry segment hosts to connect.
+- Use secure network connections such as VPNs or private networks to prevent
unauthorized access to data transmission channels.
+- Implement continuous monitoring of SSL connections and certificate
expiration dates to proactively address security issues.
+
+### Access control
+
+- Apply the principle of least privilege by granting only the minimum
permissions necessary for users and applications to perform their required
functions.
+- Implement robust authentication mechanisms including multi-factor
authentication where appropriate to verify user identities.
+- Enable comprehensive audit logging to track all access attempts, successful
connections, and security-related events for compliance and security monitoring.
+
+## Troubleshooting
+
+### SSL connection errors
+
+Check certificate configuration:
+
+```shell
+# Verifies certificate files exist.
+ls -la $PGDATA/gpfdists/
+
+# Checks certificate validity.
+openssl x509 -in $PGDATA/gpfdists/client.crt -text -noout
+```
+
+### Troubleshoot certificate verification issues
+
+1. Verify that the `root.crt` file contains the correct certificate authority
chain and that all intermediate certificates are properly included for
validation.
+2. Check certificate expiration dates using tools like `openssl x509 -dates`
to ensure that certificates have not expired and plan for renewal well in
advance.
+3. Validate that the private key file `client.key` corresponds exactly to the
public certificate in `client.crt` using certificate validation tools.
+
+### Common error messages
+
+| Error | Cause | Solution |
+|-------|-------|----------|
+| "SSL certificate verify failed" | Invalid or expired certificate | Check
certificate validity and CA |
+| "SSL handshake failed" | SSL configuration mismatch | Verify SSL settings on
both sides |
+| "Permission denied" | Incorrect file permissions | Set proper permissions on
certificate files |
+
+### Debug SSL connections
+
+Enable verbose logging:
+
+```shell
+gpfdist -d /data -p 8081 --ssl /certs -V --ssl_verify_peer on
+```
+
+## Performance considerations
+
+- SSL encryption introduces computational overhead during data transmission,
which may reduce overall throughput compared to unencrypted connections,
especially for large data transfers.
+- Certificate caching mechanisms help reduce the performance impact of SSL
handshakes by reusing established secure connections across multiple data
transfer operations.
+- Deploy multiple gpfdists instances across different hosts or network
interfaces to distribute the load and achieve better aggregate throughput for
concurrent data operations.
+- Ensure that your network infrastructure provides adequate bandwidth and low
latency between Apache Cloudberry segments and gpfdist servers to minimize
performance bottlenecks.
+
+## Migration from gpfdist to gpfdists
+
+To migrate existing gpfdist usage to gpfdists:
+
+- Install and configure the required SSL certificates on all Apache Cloudberry
segment hosts, ensuring proper permissions and certificate chain validation.
+- Update all external table definitions to change protocol specifications from
`gpfdist://` to `gpfdists://` in their LOCATION clauses.
+- Restart your gpfdist file servers with the `--ssl` option enabled,
specifying the appropriate certificate directories and SSL verification
settings.
+- Thoroughly test secure connections from all Apache Cloudberry segments to
verify that data loading operations work correctly with the new encrypted
protocol.
+- Update any automation scripts, monitoring tools, and operational procedures
to account for the new secure protocol requirements and certificate management
tasks.
+
+### Example migration
+
+Before (insecure):
+
+```sql
+LOCATION ('gpfdist://etl-server:8081/data.txt')
+```
+
+After (secure):
+
+```sql
+LOCATION ('gpfdists://etl-server:8081/data.txt')
+```
+
+## Limitations
+
+- Cannot mix `gpfdist://` and `gpfdists://` protocols in the same external
table definition.
+- All Apache Cloudberry segment hosts must have properly configured SSL
certificates and trust relationships.
+- SSL encryption introduces computational overhead that may reduce data
transfer throughput compared to unencrypted gpfdist.
+- Ongoing certificate lifecycle management is required, including renewal,
rotation, and revocation processes.
+
+## Learn more
+
+- [Load Data Using gpfdist](/docs/data-loading/load-data-using-gpfdist.md)
+- [SSL Certificate Management](https://www.sslshopper.com/ssl-converter.html)
diff --git a/versioned_docs/version-2.x/data-loading/load-data-using-pxf.md
b/versioned_docs/version-2.x/data-loading/load-data-using-pxf.md
new file mode 100644
index 0000000000..dbac9b7b42
--- /dev/null
+++ b/versioned_docs/version-2.x/data-loading/load-data-using-pxf.md
@@ -0,0 +1,327 @@
+---
+title: Load Data Using PXF
+---
+
+# Load Data Using the Platform Extension Framework (PXF)
+
+The Apache Cloudberry Platform Extension Framework (PXF) is an extensible
framework that allows Apache Cloudberry to query external data files whose
metadata is not managed by the database. You can use its `pxf://` protocol to
access data residing in object store systems (Azure, Google Cloud Storage,
Minio, S3), external Hadoop systems (HDFS, Hive, HBase), and SQL databases.
+
+PXF includes built-in connectors for accessing data that exists inside HDFS
files, Hive tables, HBase tables, and JDBC-accessible databases. Users can also
create their own connectors to other data storage or processing engines.
+
+The version of PXF used in Apache Cloudberry is forked from the Greenplum PXF
project and has been specifically adapted for it. The source code for this
adapted version is hosted by the Apache Software Foundation at
[apache/cloudberry-pxf](https://github.com/apache/cloudberry-pxf).
+
+The PXF `pxf` protocol is packaged as an Apache Cloudberry extension that
supports both reading from and writing to external data stores. The framework
includes a C-language extension and a Java service. After you configure and
initialize PXF, you start a single PXF JVM process on each Apache Cloudberry
segment host. This long-running process concurrently serves multiple query
requests.
+
+Before using the `pxf` protocol, you must explicitly initialize and start the
PXF service. You must also enable PXF in each database where it will be used
and grant permissions on the `pxf` protocol to the relevant Apache Cloudberry
users.
+
+## Install and Build PXF
+
+PXF is a component that must be built and installed separately. The source
code and detailed instructions for building, installing, and developing PXF are
available in the official `apache/cloudberry-pxf` repository. Before proceeding
with the configuration, follow the development guide in the repository to set
up PXF.
+
+**Build and Installation Guide**: [apache/cloudberry-pxf
README](https://github.com/apache/cloudberry-pxf/blob/main/README.md)
+
+## Configure PXF
+
+Before you can use PXF to access external data, you must configure and
initialize the service.
+
+### Initialize and Start PXF
+
+To use PXF, you must first initialize it. This process creates the necessary
configuration directory structure and files on each Apache Cloudberry segment
host. After initialization, you need to start the PXF service.
+
+1. Add the PXF binary directory to your `PATH`. This step ensures that you
can run `pxf` commands from any location.
+
+ ```shell
+ export PATH=/usr/local/pxf/bin:$PATH
+ ```
+
+ You can add this line to your `.bashrc` or `.zshrc` file to make it
permanent.
+
+2. Initialize PXF. The `prepare` command sets up the PXF configuration. This
command only needs to be run once.
+
+ ```shell
+ pxf prepare
+ ```
+
+3. Start the PXF service. This command starts a Java process on each segment
host that acts as the external data coordinator.
+
+ ```shell
+ pxf start
+ ```
+
+### Enable PXF in a database
+
+You must enable PXF in each database in which you want to use the `pxf`
protocol to access external data.
+
+To enable PXF in a database, you must create the PXF extension in the database:
+
+```sql
+CREATE EXTENSION pxf;
+```
+
+The `pxf` protocol is packaged as an extension with Apache Cloudberry, and you
must explicitly enable it in each database in which you plan to use it.
+
+### Grant access to PXF
+
+To allow non-superuser Apache Cloudberry roles to create external tables using
the `pxf` protocol, you must grant `SELECT` privileges on the `pxf` protocol to
each role:
+
+```sql
+GRANT SELECT ON PROTOCOL pxf TO <role_name>;
+```
+
+To allow non-superuser Apache Cloudberry roles to create writable external
tables using the `pxf` protocol, you must grant `INSERT` privileges on the
`pxf` protocol to each role:
+
+```sql
+GRANT INSERT ON PROTOCOL pxf TO <role_name>;
+```
+
+## PXF configuration overview
+
+PXF configuration includes server configuration and connector configuration. A
PXF *server* is a named configuration that provides access credentials and
other information required to access an external data source. A PXF *connector*
is the interface between PXF and the external data source.
+
+### Example: configure S3 connector
+
+Create a PXF configuration for accessing S3 data:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<configuration>
+ <property>
+ <name>fs.s3a.access.key</name>
+ <value>YOUR_ACCESS_KEY</value>
+ </property>
+ <property>
+ <name>fs.s3a.secret.key</name>
+ <value>YOUR_SECRET_KEY</value>
+ </property>
+ <property>
+ <name>fs.s3a.endpoint</name>
+ <value>s3.amazonaws.com</value>
+ </property>
+</configuration>
+```
+
+### Example: configure JDBC connector
+
+For accessing SQL databases:
+
+```xml
+<?xml version="1.0" encoding="UTF-8"?>
+<configuration>
+ <property>
+ <name>jdbc.driver</name>
+ <value>org.postgresql.Driver</value>
+ </property>
+ <property>
+ <name>jdbc.url</name>
+ <value>jdbc:postgresql://hostname:5432/database</value>
+ </property>
+ <property>
+ <name>jdbc.user</name>
+ <value>username</value>
+ </property>
+ <property>
+ <name>jdbc.password</name>
+ <value>password</value>
+ </property>
+</configuration>
+```
+
+## Step 3. Create PXF external tables
+
+PXF external tables use the `pxf://` protocol in the `LOCATION` clause. The
URL format varies depending on the data source and connector.
+
+### Read from HDFS
+
+Access text files stored in HDFS:
+
+```sql
+CREATE EXTERNAL TABLE sales_hdfs (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/sales/sales.txt?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|');
+```
+
+### Read from Hive tables
+
+Access Hive tables directly:
+
+```sql
+CREATE EXTERNAL TABLE hive_sales (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hive-metastore:9083/sales_db.sales_table?PROFILE=hive')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Read from Parquet files
+
+Access Parquet files in object storage:
+
+```sql
+CREATE EXTERNAL TABLE parquet_data (
+ id bigint,
+ name text,
+ created_date date
+)
+LOCATION
('pxf://s3a://my-bucket/data/events.parquet?PROFILE=s3:parquet&SERVER=s3-server')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Read from SQL databases
+
+Query external SQL databases:
+
+```sql
+CREATE EXTERNAL TABLE external_customers (
+ customer_id int,
+ customer_name text,
+ email text,
+ registration_date date
+)
+LOCATION
('pxf://postgresql-server/customers?PROFILE=jdbc&SERVER=postgres-server')
+FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+```
+
+### Write data with PXF
+
+Create writable external tables to export data:
+
+```sql
+CREATE WRITABLE EXTERNAL TABLE export_to_hdfs (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/exports/sales_export?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|')
+DISTRIBUTED BY (transaction_id);
+```
+
+## PXF URL format
+
+The PXF protocol URL follows this syntax:
+
+```
+pxf://<host>[:<port>]/<path-to-data>?PROFILE=<profile_name>[&<custom-option>=<value>][&SERVER=<server_name>]
+```
+
+Where:
+- `host:port`: Location of the external data source
+- `path-to-data`: Path to the specific data (file, directory, table, etc.)
+- `PROFILE`: PXF connector profile (e.g., `hdfs:text`, `hive`, `s3:parquet`,
`jdbc`)
+- `SERVER`: Named server configuration (optional)
+- `custom-option`: Additional connector-specific options
+
+## Common PXF profiles
+
+| Profile | Data Source | Format | Use Case |
+|---------|-------------|--------|----------|
+| `hdfs:text` | HDFS | Text files | Delimited text data |
+| `hdfs:avro` | HDFS | Avro files | Schema evolution support |
+| `hdfs:parquet` | HDFS | Parquet files | Columnar analytics |
+| `hdfs:orc` | HDFS | ORC files | Optimized row columnar |
+| `hive` | Hive | Various | Hive table access |
+| `hbase` | HBase | HBase | NoSQL data access |
+| `s3:text` | S3 | Text files | Cloud object storage |
+| `s3:parquet` | S3 | Parquet files | Cloud analytics |
+| `jdbc` | SQL Database | Various | External database access |
+
+## Performance optimization
+
+### Partition pruning
+
+PXF supports partition pruning for Hive tables:
+
+```sql
+SELECT * FROM hive_sales
+WHERE sale_date >= '2024-01-01' AND sale_date < '2024-02-01';
+```
+
+### Predicate pushdown
+
+Enable predicate pushdown for better performance:
+
+```sql
+-- This filter can be pushed down to the external source
+SELECT * FROM external_customers
+WHERE registration_date > '2024-01-01';
+```
+
+### Parallel processing
+
+Leverage multiple files for better parallelism:
+
+```sql
+CREATE EXTERNAL TABLE multi_file_data (
+ id bigint,
+ data text
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/partitioned/*?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER ',');
+```
+
+## Error handling
+
+Enable error logging for data quality monitoring:
+
+```sql
+CREATE EXTERNAL TABLE sales_with_errors (
+ transaction_id int,
+ product_name text,
+ sale_date date,
+ amount decimal(10,2)
+)
+LOCATION ('pxf://hdfs-namenode:8020/data/sales/?PROFILE=hdfs:text')
+FORMAT 'TEXT' (DELIMITER '|')
+LOG ERRORS SEGMENT REJECT LIMIT 50;
+```
+
+## Best practices
+
+- Select the most specific PXF profile that matches your data format and
storage system to ensure optimal performance and feature support.
+- Create reusable named server configurations to centralize connection
parameters and credentials, making it easier to manage multiple external data
sources.
+- Structure your external data using file sizes between 100MB and 1GB to
achieve optimal parallel processing across Apache Cloudberry segments.
+- Leverage compressed storage formats such as Parquet or ORC to reduce I/O
overhead and improve query performance.
+- Implement comprehensive error logging and monitoring in production
environments to quickly identify and resolve data access issues.
+- Establish proper authentication mechanisms and enable encryption for all
connections to sensitive external data sources.
+
+## Limitations
+
+- PXF requires the external data source to be accessible from all segment
hosts.
+- Some data sources may have specific version compatibility requirements.
+- Write operations depend on the connector's capabilities.
+- Performance can be affected by network latency to external sources.
+
+## Troubleshooting
+
+### Check PXF service status
+
+```shell
+pxf cluster status
+```
+
+### View PXF logs
+
+```shell
+pxf cluster logs
+```
+
+### Test connectivity
+
+```sql
+SELECT * FROM pxf_external_table LIMIT 5;
+```
+
+## Learn More
+
+For more details about the Apache Cloudberry Platform Extension Framework
(PXF), please refer to the official PXF project repository, which contains the
source code, documentation, and contribution guidelines.
+
+- [Apache Cloudberry PXF Project](https://github.com/apache/cloudberry-pxf)
diff --git
a/versioned_docs/version-2.x/developer/write-a-foreign-data-wrapper.md
b/versioned_docs/version-2.x/developer/write-a-foreign-data-wrapper.md
new file mode 100644
index 0000000000..92140fd353
--- /dev/null
+++ b/versioned_docs/version-2.x/developer/write-a-foreign-data-wrapper.md
@@ -0,0 +1,411 @@
+---
+title: Write a Foreign Data Wrapper
+---
+
+# Write a Foreign Data Wrapper
+
+This guide outlines how to write a new foreign data wrapper (FDW) for Apache
Cloudberry. A foreign data wrapper is a library that consists of a set of
functions that the Apache Cloudberry server calls to access external data
sources. The FDW is responsible for fetching data from remote data stores and
returning it to the Apache Cloudberry executor.
+
+FDWs enable Apache Cloudberry to treat external data sources as if they were
regular database tables, allowing you to query external data using standard SQL.
+
+## Before you begin
+
+### Requirements
+
+When developing with the Apache Cloudberry foreign data wrapper API:
+
+- Your development system must have the same hardware and software
architecture as your Apache Cloudberry hosts
+- Code must be written in a compiled language such as C, using the version-1
interface
+- Symbol names in your object files must not conflict with each other or with
symbols defined in the Apache Cloudberry server
+- You should be familiar with the foreign table concepts described in
[Accessing External Data with Foreign Tables](/docs/external/g-foreign.html)
+
+### Header files
+
+The Apache Cloudberry header files for FDW development are located in
`$GPHOME/include/postgresql/server/`:
+
+- `foreign/fdwapi.h` - FDW API structures and callback function signatures
+- `foreign/foreign.h` - Foreign data wrapper helper structs and functions
+- `catalog/pg_foreign_table.h` - Foreign table definition
+- `catalog/pg_foreign_server.h` - Foreign server definition
+
+## FDW architecture
+
+An FDW consists of two main components:
+
+1. A handler function that returns a struct containing function pointers to
all the callback functions needed by Apache Cloudberry to interact with the
external data source.
+2. An optional validator function that validates configuration options
provided in `CREATE` and `ALTER` commands for the foreign data wrapper,
servers, user mappings, and foreign tables.
+
+### Handler function
+
+The handler function is SQL-invokable and returns a struct containing pointers
to callback functions:
+
+```c
+CREATE FUNCTION my_fdw_handler()
+ RETURNS fdw_handler
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+```
+
+### Validator function
+
+The validator function validates options for FDW objects:
+
+```c
+CREATE FUNCTION my_fdw_validator(text[], oid)
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+```
+
+## Callback functions
+
+The FDW API defines callback functions that Apache Cloudberry invokes when
scanning and updating foreign tables. The handler function returns a
`FdwRoutine` struct containing pointers to these functions.
+
+### Required scan-related callbacks
+
+These functions are required for all FDWs:
+
+#### GetForeignRelSize
+
+```c
+void GetForeignRelSize(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid)
+```
+
+Obtain relation size estimates for a foreign table. Called at the beginning of
planning.
+
+#### GetForeignPaths
+
+```c
+void GetForeignPaths(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid)
+```
+
+Create possible access paths for a scan on a foreign table. Must call
`create_foreignscan_path()` for Apache Cloudberry compatibility.
+
+#### GetForeignPlan
+
+```c
+ForeignScan *GetForeignPlan(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid,
+ ForeignPath *best_path,
+ List *tlist,
+ List *scan_clauses)
+```
+
+Create a ForeignScan plan node from the selected foreign access path.
+
+#### BeginForeignScan
+
+```c
+void BeginForeignScan(ForeignScanState *node,
+ int eflags)
+```
+
+Begin executing a foreign scan. Called during executor startup.
+
+#### IterateForeignScan
+
+```c
+TupleTableSlot *IterateForeignScan(ForeignScanState *node)
+```
+
+Fetch one row from the foreign source. Return NULL when no more rows are
available.
+
+#### ReScanForeignScan
+
+```c
+void ReScanForeignScan(ForeignScanState *node)
+```
+
+Restart the scan from the beginning.
+
+#### EndForeignScan
+
+```c
+void EndForeignScan(ForeignScanState *node)
+```
+
+End the scan and release resources.
+
+### Optional update-related callbacks
+
+If your FDW supports write operations, implement these functions:
+
+#### ExecForeignInsert
+
+```c
+TupleTableSlot *ExecForeignInsert(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Insert a single tuple into the foreign table.
+
+#### ExecForeignUpdate
+
+```c
+TupleTableSlot *ExecForeignUpdate(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Update a single tuple in the foreign table.
+
+#### ExecForeignDelete
+
+```c
+TupleTableSlot *ExecForeignDelete(EState *estate,
+ ResultRelInfo *rinfo,
+ TupleTableSlot *slot,
+ TupleTableSlot *planSlot)
+```
+
+Delete a single tuple from the foreign table.
+
+## Apache Cloudberry considerations
+
+### The mpp_execute option
+
+Apache Cloudberry supports parallel execution through the `mpp_execute`
option. Your FDW should handle this option to determine where to request or
send data:
+
+```c
+ForeignTable *table = GetForeignTable(foreigntableid);
+if (table->exec_location == FTEXECLOCATION_ALL_SEGMENTS)
+{
+ // Execute on all segments in parallel
+}
+else if (table->exec_location == FTEXECLOCATION_ANY)
+{
+ // Execute on coordinator or any one segment
+}
+else if (table->exec_location == FTEXECLOCATION_COORDINATOR)
+{
+ // Execute on coordinator only (default)
+}
+```
+
+### Segment identification
+
+For parallel execution (`mpp_execute 'all segments'`), each segment must
determine which portion of the data is its responsibility:
+
+```c
+int segmentNumber = GpIdentity.segindex;
+int totalNumberOfSegments = getgpsegmentCount();
+```
+
+### Parallel write operations
+
+Apache Cloudberry supports parallel write operations only when `mpp_execute`
is set to `'all segments'`. For other settings, write operations are initiated
through the coordinator.
+
+## Example FDW implementation
+
+Here's a basic skeleton for an FDW:
+
+```c
+#include "postgres.h"
+#include "foreign/fdwapi.h"
+#include "foreign/foreign.h"
+
+PG_MODULE_MAGIC;
+
+// Function declarations
+extern Datum my_fdw_handler(PG_FUNCTION_ARGS);
+extern Datum my_fdw_validator(PG_FUNCTION_ARGS);
+
+// FDW callback functions
+static void myGetForeignRelSize(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid);
+static void myGetForeignPaths(PlannerInfo *root,
+ RelOptInfo *baserel,
+ Oid foreigntableid);
+// ... other callback functions
+
+PG_FUNCTION_INFO_V1(my_fdw_handler);
+Datum
+my_fdw_handler(PG_FUNCTION_ARGS)
+{
+ FdwRoutine *fdwroutine = makeNode(FdwRoutine);
+
+ // Required scan functions
+ fdwroutine->GetForeignRelSize = myGetForeignRelSize;
+ fdwroutine->GetForeignPaths = myGetForeignPaths;
+ fdwroutine->GetForeignPlan = myGetForeignPlan;
+ fdwroutine->BeginForeignScan = myBeginForeignScan;
+ fdwroutine->IterateForeignScan = myIterateForeignScan;
+ fdwroutine->ReScanForeignScan = myReScanForeignScan;
+ fdwroutine->EndForeignScan = myEndForeignScan;
+
+ // Optional write functions
+ fdwroutine->ExecForeignInsert = myExecForeignInsert;
+ fdwroutine->ExecForeignUpdate = myExecForeignUpdate;
+ fdwroutine->ExecForeignDelete = myExecForeignDelete;
+
+ PG_RETURN_POINTER(fdwroutine);
+}
+
+PG_FUNCTION_INFO_V1(my_fdw_validator);
+Datum
+my_fdw_validator(PG_FUNCTION_ARGS)
+{
+ List *options_list = untransformRelOptions(PG_GETARG_DATUM(0));
+ Oid catalog = PG_GETARG_OID(1);
+
+ // Validate options based on catalog type
+ // Implementation depends on your FDW's specific options
+
+ PG_RETURN_VOID();
+}
+```
+
+## Building and packaging
+
+### Using PGXS
+
+Create a Makefile using the PostgreSQL build extension infrastructure:
+
+```makefile
+MODULE_big = my_fdw
+OBJS = my_fdw.o
+
+PG_CONFIG = pg_config
+PGXS := $(shell $(PG_CONFIG) --pgxs)
+
+PG_CPPFLAGS = -I$(shell $(PG_CONFIG) --includedir)
+SHLIB_LINK = -L$(shell $(PG_CONFIG) --libdir)
+include $(PGXS)
+```
+
+### Creating an extension
+
+Create SQL script file (`my_fdw--1.0.sql`):
+
+```sql
+CREATE FUNCTION my_fdw_handler()
+ RETURNS fdw_handler
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FUNCTION my_fdw_validator(text[], oid)
+ RETURNS void
+ AS 'MODULE_PATHNAME'
+LANGUAGE C STRICT;
+
+CREATE FOREIGN DATA WRAPPER my_fdw
+ HANDLER my_fdw_handler
+ VALIDATOR my_fdw_validator;
+```
+
+Create control file (`my_fdw.control`):
+
+```ini
+# my_fdw extension
+comment = 'My custom foreign data wrapper'
+default_version = '1.0'
+module_pathname = '$libdir/my_fdw'
+relocatable = true
+```
+
+Update Makefile for extension:
+
+```makefile
+EXTENSION = my_fdw
+DATA = my_fdw--1.0.sql
+```
+
+## Deployment
+
+For Apache Cloudberry clusters:
+
+1. Install shared library to `$GPHOME/lib/postgresql/` on all hosts.
+2. Install extension files to `$GPHOME/share/postgresql/extension/` on all
hosts.
+3. Set permissions so `gpadmin` user can access all files.
+4. Install on all hosts - coordinator and all segment hosts.
+
+## Testing your FDW
+
+### Basic functionality test
+
+```sql
+-- Create extension
+CREATE EXTENSION my_fdw;
+
+-- Create foreign server
+CREATE SERVER my_server
+ FOREIGN DATA WRAPPER my_fdw
+ OPTIONS (host 'remote-host', port '5432');
+
+-- Create user mapping
+CREATE USER MAPPING FOR current_user
+ SERVER my_server
+ OPTIONS (user 'remote_user', password 'remote_password');
+
+-- Create foreign table
+CREATE FOREIGN TABLE my_foreign_table (
+ id integer,
+ name text
+) SERVER my_server
+OPTIONS (table_name 'remote_table');
+
+-- Test query
+SELECT * FROM my_foreign_table LIMIT 5;
+```
+
+### Parallel execution test
+
+```sql
+-- Test parallel execution
+CREATE FOREIGN TABLE parallel_test (
+ id integer,
+ data text
+) SERVER my_server
+OPTIONS (
+ table_name 'large_table',
+ mpp_execute 'all segments'
+);
+
+SELECT count(*) FROM parallel_test;
+```
+
+## Best practices
+
+- Implement comprehensive error handling throughout your FDW code, ensuring
that all failure modes are properly reported to users with clear, actionable
error messages.
+- Use PostgreSQL memory contexts correctly to prevent memory leaks and ensure
proper cleanup when queries are cancelled or encounter errors.
+- Design your FDW to handle bulk data operations efficiently and take
advantage of Apache Cloudberry's parallel processing capabilities where
appropriate.
+- Validate all user inputs and configuration parameters, implement secure
authentication mechanisms, and follow security best practices for external data
access.
+- Create thorough documentation that explains all configuration options,
connection parameters, and usage examples for administrators and developers.
+- Develop comprehensive test suites that cover various data types, edge cases,
error conditions, and performance scenarios across different external data
sources.
+
+## Debugging
+
+### Enable verbose logging
+
+Set in postgresql.conf:
+```ini
+log_min_messages = DEBUG1
+```
+
+### Use elog for debugging
+
+```c
+elog(DEBUG1, "FDW: Processing %d rows", row_count);
+elog(WARNING, "FDW: Connection failed, retrying...");
+```
+
+### Common issues
+
+1. Segmentation faults frequently occur due to improper memory management,
such as accessing freed memory or buffer overruns in C code.
+2. Symbol name conflicts can arise when multiple extensions define functions
with the same names; ensure all function names are unique and properly
namespaced.
+3. Parallel execution issues are often difficult to debug; start testing with
single-segment configurations before enabling multi-segment parallel processing.
+4. Performance bottlenecks may appear in data conversion or network
operations; use profiling tools to identify and optimize critical code paths.
+
+## Learn more
+
+- [Foreign Data Wrapper Helper
Functions](https://www.postgresql.org/docs/12/fdw-helpers.html)
+- [Foreign Data Wrapper Callback
Routines](https://www.postgresql.org/docs/12/fdw-callbacks.html)
diff --git a/versioned_sidebars/version-2.x-sidebars.json
b/versioned_sidebars/version-2.x-sidebars.json
index 012a2fceda..caab80abc4 100644
--- a/versioned_sidebars/version-2.x-sidebars.json
+++ b/versioned_sidebars/version-2.x-sidebars.json
@@ -48,16 +48,26 @@
"items": [
{
"type": "category",
- "label": "Load Data from Local Files",
+ "label": "From Local or Network Files",
"items": [
"data-loading/load-data-using-copy",
"data-loading/load-data-using-gpfdist",
+ "data-loading/load-data-using-gpfdists",
"data-loading/load-data-using-file-protocol",
"data-loading/load-data-using-gpload"
]
},
+ {
+ "type": "category",
+ "label": "From Cloud or Big Data Systems",
+ "items": [
+ "data-loading/load-data-from-s3",
+ "data-loading/load-data-using-pxf"
+ ]
+ },
"data-loading/load-data-from-web-services",
- "data-loading/load-data-from-kafka-using-fdw"
+ "data-loading/load-data-from-kafka-using-fdw",
+ "data-loading/handle-data-errors"
]
},
{
@@ -201,7 +211,8 @@
"type": "category",
"label": "Developer",
"items": [
- "developer/develop-extensions-using-rust"
+ "developer/develop-extensions-using-rust",
+ "developer/write-a-foreign-data-wrapper"
]
},
{
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]