This is an automated email from the ASF dual-hosted git repository.
djwang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry-pxf.git
The following commit(s) were added to refs/heads/main by this push:
new 0d605aaa docs - pxf works with ORC using Foreign Data Wrapper
0d605aaa is described below
commit 0d605aaaf11fe7551a8fe69aa55c1155cd2245b7
Author: Nikolay Antonov <[email protected]>
AuthorDate: Fri Feb 6 11:57:12 2026 +0500
docs - pxf works with ORC using Foreign Data Wrapper
---
docs/content/hdfs_parquet.html.md.erb | 83 ++++++++++++++++++++++++-------
docs/content/objstore_parquet.html.md.erb | 58 +++++++++++++++++----
2 files changed, 114 insertions(+), 27 deletions(-)
diff --git a/docs/content/hdfs_parquet.html.md.erb
b/docs/content/hdfs_parquet.html.md.erb
index 9ad05b78..2646f00e 100644
--- a/docs/content/hdfs_parquet.html.md.erb
+++ b/docs/content/hdfs_parquet.html.md.erb
@@ -35,7 +35,7 @@ Ensure that you have met the PXF Hadoop
[Prerequisites](access_hdfs.html#hadoop_
## <a id="datatype_map"></a>Data Type Mapping
-To read and write Parquet primitive data types in Greenplum Database, map
Parquet data values to Greenplum Database columns of the same type.
+To read and write Parquet primitive data types in Apache Cloudberry, map
Parquet data values to Apache Cloudberry columns of the same type.
Parquet supports a small set of primitive data types, and uses metadata
annotations to extend the data types that it supports. These annotations
specify how to interpret the primitive type. For example, Parquet stores both
`INTEGER` and `DATE` types as the `INT32` primitive type. An annotation
identifies the original type as a `DATE`.
@@ -45,7 +45,7 @@ Parquet supports a small set of primitive data types, and
uses metadata annotati
PXF uses the following data type mapping when reading Parquet data:
-| Parquet Physical Type | Parquet Logical Type | PXF/Greenplum Data Type |
+| Parquet Physical Type | Parquet Logical Type | PXF/Cloudberry Data Type |
|-------------------|---------------|--------------------------|
| boolean | -- | Boolean |
| binary \(byte\_array\) | -- | Bytea |
@@ -67,7 +67,7 @@ PXF uses the following data type mapping when reading Parquet
data:
PXF can read a Parquet `LIST` nested type when it represents a one-dimensional
array of certain Parquet types. The supported mappings follow:
-| Parquet Data Type | PXF/Greenplum Data Type |
+| Parquet Data Type | PXF/Cloudberry Data Type |
|-------------------|-------------------------|
| list of \<boolean> | Boolean[] |
| list of \<binary> | Bytea[] |
@@ -90,7 +90,7 @@ PXF can read a Parquet `LIST` nested type when it represents
a one-dimensional a
PXF uses the following data type mapping when writing Parquet data:
-| PXF/Greenplum Data Type | Parquet Physical Type | Parquet Logical Type |
+| PXF/Cloudberry Data Type | Parquet Physical Type | Parquet Logical Type |
|-------------------|---------------|--------------------------|
| Bigint | int64 | -- |
| Boolean | boolean | -- |
@@ -114,7 +114,7 @@ PXF uses the following data type mapping when writing
Parquet data:
PXF can write a one-dimensional `LIST` of certain Parquet data types. The
supported mappings follow:
-| PXF/Greenplum Data Type | Parquet Data Type |
+| PXF/Cloudberry Data Type | Parquet Data Type |
|-------------------|--------------------------|
| Bigint[] | list of \<int64> |
| Boolean[] | list of \<boolean> |
@@ -149,7 +149,7 @@ When you provide the Parquet schema file to PXF, you must
specify the absolute p
The PXF HDFS connector `hdfs:parquet` profile supports reading and writing
HDFS data in Parquet-format. When you insert records into a writable external
table, the block(s) of data that you insert are written to one or more files in
the directory that you specified.
-Use the following syntax to create a Greenplum Database external table that
references an HDFS directory:
+Use the following syntax to create a Apache Cloudberry external table that
references an HDFS directory:
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
@@ -160,7 +160,7 @@ FORMAT 'CUSTOM'
(FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
-The specific keywords and values used in the Greenplum Database [CREATE
EXTERNAL
TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html)
command are described in the table below.
+The specific keywords and values used in the Apache Cloudberry [CREATE
EXTERNAL
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table/)
command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
@@ -169,10 +169,36 @@ The specific keywords and values used in the Greenplum
Database [CREATE EXTERNAL
| SERVER=\<server_name\> | The named server configuration that PXF uses to
access the data. PXF uses the `default` server if not specified. |
| \<custom‑option\> | \<custom-option\>s are described below.|
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with
`(FORMATTER='pxfwritable_export')` (write) or
`(FORMATTER='pxfwritable_import')` (read). |
-| DISTRIBUTED BY | If you want to load data from an existing Greenplum
Database table into the writable external table, consider specifying the same
distribution policy or `<column_name>` on both tables. Doing so will avoid
extra motion of data between segments on the load operation. |
+| DISTRIBUTED BY | If you want to load data from an existing Apache
Cloudberry table into the writable external table, consider specifying the same
distribution policy or `<column_name>` on both tables. Doing so will avoid
extra motion of data between segments on the load operation. |
+
+
+## <a id="profile_cfdw"></a>Creating the Foreign Table
+
+The PXF HDFS `hdfs_pxf_fdw` foreign data wrapper supports reading and writing
Parquet-formatted HDFS files. When you insert records into a foreign table, the
block(s) of data that you insert are written to one file per segment in the
directory that you specified in the `resource` clause.
+
+Use the following syntax to create an Apache Cloudberry foreign table that
references an HDFS file or directory:
+
+``` sql
+CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER hdfs_pxf_fdw;
+CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;
+
+CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
+ ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+ SERVER <foreign_server>
+ OPTIONS ( resource '<path-to-file>', format 'parquet' [, <custom-option>
'<value>'[...]]);
+```
+
+The specific keywords and values used in the Apache Cloudberry [CREATE FOREIGN
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-foreign-table)
command are described below.
+
+| Keyword | Value |
+|-------|-------------------------------------|
+| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for Foreign
Tables](access_s3.html#s3_override_fdw) |
+| \<path‑to‑hdfs‑file\> | The path to the directory in
the HDFS data store. When the `<server_name>` configuration includes a
[`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF
considers \<path‑to‑hdfs‑file\> to be relative to the base
path specified. Otherwise, PXF considers it to be an absolute path.
\<path‑to‑hdfs‑file\> must not specify a relative path nor
include the dollar sign (`$`) character. |
+| format | The file format; specify `'parquet'` for Parquet-formatted data.
|
+| \<custom-option\> | \<custom-option\>s are described below. |
<a id="customopts"></a>
-The PXF `hdfs:parquet` profile supports the following read option. You specify
this option in the `CREATE EXTERNAL TABLE` `LOCATION` clause:
+The PXF `hdfs:parquet` profile supports the following read option:
| Read Option | Value Description |
|-------|-------------------------------------|
@@ -188,7 +214,7 @@ The PXF `hdfs:parquet` profile supports encoding- and
compression-related write
| ENABLE\_DICTIONARY | A boolean value that specifies whether or not to enable
dictionary encoding. The default value is `true`; dictionary encoding is
enabled when PXF writes Parquet files. |
| DICTIONARY\_PAGE\_SIZE | When dictionary encoding is enabled, there is a
single dictionary page per column, per row group. `DICTIONARY_PAGE_SIZE` is
similar to `PAGE_SIZE`, but for the dictionary. The default dictionary page
size is `1 * 1024 * 1024` bytes. |
| PARQUET_VERSION | The Parquet version; PXF supports the values `v1` and `v2`
for this option. The default Parquet version is `v1`. |
-| SCHEMA | The absolute path to the Parquet schema file on the Greenplum host
or on HDFS. |
+| SCHEMA | The absolute path to the Parquet schema file on the Cloudberry PXF
host or on HDFS. |
**Note**: You must explicitly specify `uncompressed` if you do not want PXF to
compress the data.
@@ -208,12 +234,29 @@ This example utilizes the data schema introduced in
[Example: Reading Text Data
In this example, you create a Parquet-format writable external table that uses
the default PXF server to reference Parquet-format data in HDFS, insert some
data into the table, and then create a readable external table to read the data.
-1. Use the `hdfs:parquet` profile to create a writable external table. For
example:
+1. Apache Cloudberry does not support both reading and writing single external
table. Create two table - one for read and one for write referencing same HDFS
directory:
``` sql
postgres=# CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet (location text,
month text, number_of_orders int, item_quantity_per_order int[], total_sales
double precision)
LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
+
+ postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month
text, number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
+ LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
+ FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+ ```
+
+ OR create single foreign table to read and write operations:
+
+ ```
+ testdb=# CREATE SERVER example_parquet FOREIGN DATA WRAPPER
hdfs_pxf_fdw;
+ testdb=# CREATE USER MAPPING FOR CURRENT_USER SERVER example_parquet;
+ testdb=# CREATE FOREIGN TABLE pxf_tbl_parquet(location text, month
text, number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
+ SERVER example_parquet
+ OPTIONS (
+ resource 'data/pxf_examples/pxf_parquet',
+ format 'parquet'
+ );
```
2. Write a few records to the `pxf_parquet` HDFS directory by inserting
directly into the `pxf_tbl_parquet` table. For example:
@@ -223,20 +266,24 @@ In this example, you create a Parquet-format writable
external table that uses t
postgres=# INSERT INTO pxf_tbl_parquet VALUES ( 'Cleveland', 'Oct', 2,
'{3333,7777}', 96645.37 );
```
-3. Recall that Greenplum Database does not support directly querying a
writable external table. To read the data in `pxf_parquet`, create a readable
external Greenplum Database referencing this HDFS directory:
+3. Query the readable external table `read_pxf_parquet`:
``` sql
- postgres=# CREATE EXTERNAL TABLE read_pxf_parquet(location text, month
text, number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
- LOCATION ('pxf://data/pxf_examples/pxf_parquet?PROFILE=hdfs:parquet')
- FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
+ postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
+ ```
+ ``` pre
+ location | month | number_of_orders | item_quantity_per_order |
total_sales
+
-----------+-------+------------------+-------------------------+-------------
+ Frankfurt | Mar | 777 | {1,11,111} |
3956.98
+ Cleveland | Oct | 3812 | {3333,7777} |
96645.4
+ (2 rows)
```
-4. Query the readable external table `read_pxf_parquet`:
+ OR query the same foreign table `pxf_tbl_parquet`:
``` sql
- postgres=# SELECT * FROM read_pxf_parquet ORDER BY total_sales;
+ postgres=# SELECT * FROM pxf_tbl_parquet ORDER BY total_sales;
```
-
``` pre
location | month | number_of_orders | item_quantity_per_order |
total_sales
-----------+-------+------------------+-------------------------+-------------
diff --git a/docs/content/objstore_parquet.html.md.erb
b/docs/content/objstore_parquet.html.md.erb
index e0c1f1cb..50cc26a2 100644
--- a/docs/content/objstore_parquet.html.md.erb
+++ b/docs/content/objstore_parquet.html.md.erb
@@ -32,7 +32,7 @@ Ensure that you have met the PXF Object Store
[Prerequisites](access_objstore.ht
## <a id="datatype_map"></a>Data Type Mapping
-Refer to [Data Type Mapping](hdfs_parquet.html#datatype_map) in the PXF HDFS
Parquet documentation for a description of the mapping between Greenplum
Database and Parquet data types.
+Refer to [Data Type Mapping](hdfs_parquet.html#datatype_map) in the PXF HDFS
Parquet documentation for a description of the mapping between Apache
Cloudberry and Parquet data types.
## <a id="profile_cet"></a>Creating the External Table
@@ -47,7 +47,7 @@ The PXF `<objstore>:parquet` profiles support reading and
writing data in Parque
| S3 | s3 |
-Use the following syntax to create a Greenplum Database external table that
references an HDFS directory. When you insert records into a writable external
table, the block(s) of data that you insert are written to one or more files in
the directory that you specified.
+Use the following syntax to create a Apache Cloudberry external table that
references an HDFS directory. When you insert records into a writable external
table, the block(s) of data that you insert are written to one or more files in
the directory that you specified.
``` sql
CREATE [WRITABLE] EXTERNAL TABLE <table_name>
@@ -58,7 +58,7 @@ FORMAT 'CUSTOM'
(FORMATTER='pxfwritable_import'|'pxfwritable_export')
[DISTRIBUTED BY (<column_name> [, ... ] ) | DISTRIBUTED RANDOMLY];
```
-The specific keywords and values used in the Greenplum Database [CREATE
EXTERNAL
TABLE](https://docs.vmware.com/en/VMware-Greenplum/6/greenplum-database/ref_guide-sql_commands-CREATE_EXTERNAL_TABLE.html)
command are described in the table below.
+The specific keywords and values used in the Apache Cloudberry [CREATE
EXTERNAL
TABLE](https://cloudberry.apache.org/docs/sql-stmts/create-external-table/)
command are described in the table below.
| Keyword | Value |
|-------|-------------------------------------|
@@ -67,30 +67,70 @@ The specific keywords and values used in the Greenplum
Database [CREATE EXTERNAL
| SERVER=\<server_name\> | The named server configuration that PXF uses to
access the data. |
| \<custom‑option\>=\<value\> | Parquet-specific custom options are
described in the [PXF HDFS Parquet
documentation](hdfs_parquet.html#customopts). |
| FORMAT 'CUSTOM' | Use `FORMAT` '`CUSTOM`' with
`(FORMATTER='pxfwritable_export')` (write) or
`(FORMATTER='pxfwritable_import')` (read). |
-| DISTRIBUTED BY | If you want to load data from an existing Greenplum
Database table into the writable external table, consider specifying the same
distribution policy or `<column_name>` on both tables. Doing so will avoid
extra motion of data between segments on the load operation. |
+| DISTRIBUTED BY | If you want to load data from an existing Apache
Cloudberry table into the writable external table, consider specifying the same
distribution policy or `<column_name>` on both tables. Doing so will avoid
extra motion of data between segments on the load operation. |
If you are accessing an S3 object store:
- You can provide S3 credentials via custom options in the `CREATE EXTERNAL
TABLE` command as described in [Overriding the S3 Server Configuration for
External Tables DDL](access_s3.html#s3_override_ext).
- If you are reading Parquet data from S3, you can direct PXF to use the S3
Select Amazon service to retrieve the data. Refer to [Using the Amazon S3
Select Service](access_s3.html#s3_select) for more information about the PXF
custom option used for this purpose.
+## <a id="profile_cfdw"></a>Creating the Foreign Table
+
+Use one of the following foreign data wrappers with `format 'parquet'`.
+
+| Object Store | Foreign Data Wrapper |
+|-------|-------------------------------------|
+| Azure Blob Storage | wasbs_pxf_fdw |
+| Azure Data Lake Storage Gen2 | abfss_pxf_fdw |
+| Google Cloud Storage | gs_pxf_fdw |
+| MinIO | s3_pxf_fdw |
+| S3 | s3_pxf_fdw |
+
+The following syntax creates a Apache Cloudberry foreign table that references
an Parquet-format file:
+
+``` sql
+CREATE SERVER <foreign_server> FOREIGN DATA WRAPPER <store>_pxf_fdw;
+CREATE USER MAPPING FOR <user_name> SERVER <foreign_server>;
+
+CREATE FOREIGN TABLE [ IF NOT EXISTS ] <table_name>
+ ( <column_name> <data_type> [, ...] | LIKE <other_table> )
+ SERVER <foreign_server>
+ OPTIONS ( resource '<path-to-file>', format 'parquet' [, <custom-option>
'<value>' [, ...] ]);
+```
+
+| Keyword | Value |
+|-------|-------------------------------------|
+| \<foreign_server\> | The named server configuration that PXF uses to
access the data. You can override credentials in `CREATE SERVER` statement as
described in [Overriding the S3 Server Configuration for Foreign
Tables](access_s3.html#s3_override_fdw) |
+| resource \<path‑to‑file\> | The path to the directory or file
in the object store. When the `<server_name>` configuration includes a
[`pxf.fs.basePath`](cfg_server.html#pxf-fs-basepath) property setting, PXF
considers \<path‑to‑file\> to be relative to the base path
specified. Otherwise, PXF considers it to be an absolute path.
\<path‑to‑file\> must not specify a relative path nor include the
dollar sign (`$`) character. |
+| format 'parquet' | The file format; specify `'parquet'` for
Parquet-formatted data. |
+| \<custom‑option\>=\<value\> | parquet-specific custom options are
described in the [PXF HDFS parquet
documentation](hdfs_parquet.html#customopts). |
+
+
## <a id="example"></a> Example
Refer to the [Example](hdfs_parquet.html#parquet_write) in the PXF HDFS
Parquet documentation for a Parquet write/read example. Modifications that you
must make to run the example with an object store include:
-- Using the `CREATE WRITABLE EXTERNAL TABLE` syntax and `LOCATION` keywords
and settings described above for the writable external table. For example, if
your server name is `s3srvcfg`:
+- Using the `CREATE WRITABLE EXTERNAL TABLE` syntax and `LOCATION` keywords
and settings described above for the writable and readable external tables. For
example, if your server name is `s3srvcfg`:
``` sql
CREATE WRITABLE EXTERNAL TABLE pxf_tbl_parquet_s3 (location text, month
text, number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
LOCATION
('pxf://BUCKET/pxf_examples/pxf_parquet?PROFILE=s3:parquet&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_export');
- ```
-
-- Using the `CREATE EXTERNAL TABLE` syntax and `LOCATION` keywords and
settings described above for the readable external table. For example, if your
server name is `s3srvcfg`:
- ``` sql
CREATE EXTERNAL TABLE read_pxf_parquet_s3(location text, month text,
number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
LOCATION
('pxf://BUCKET/pxf_examples/pxf_parquet?PROFILE=s3:parquet&SERVER=s3srvcfg')
FORMAT 'CUSTOM' (FORMATTER='pxfwritable_import');
```
+- Using the `CREATE FOREIGN TABLE` syntax and settings described above for the
foreign table. For example, if your server name is `s3srvcfg`:
+ ``` sql
+ CREATE SERVER s3srvcfg FOREIGN DATA WRAPPER s3_pxf_fdw;
+ CREATE USER MAPPING FOR CURRENT_USER SERVER s3srvcfg;
+
+ CREATE FOREIGN TABLE pxf_parquet_s3 (location text, month text,
number_of_orders int, item_quantity_per_order int[], total_sales double
precision)
+ SERVER s3srvcfg
+ OPTIONS (
+ resource 'BUCKET/pxf_examples/pxf_parquet',
+ format 'parquet'
+ )
+ ```
\ No newline at end of file
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]