(cloudberry-pxf) branch main updated: docs - pxf_fdw works with HDFS data sources

djwang Mon, 23 Feb 2026 18:28:15 -0800

This is an automated email from the ASF dual-hosted git repository.

djwang pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/cloudberry-pxf.git



The following commit(s) were added to refs/heads/main by this push:
     new 21da190b docs - pxf_fdw works with HDFS data sources
21da190b is described below

commit 21da190b8b2614251d17f1f321baca8dfea57329
Author: Nikolay Antonov <[email protected]>
AuthorDate: Mon Feb 16 22:19:01 2026 +0500

    docs - pxf_fdw works with HDFS data sources
---
 docs/content/access_hdfs.html.md.erb | 73 ++++++++++++++++++++++--------------
 docs/content/hive_pxf.html.md.erb    |  2 +-
 2 files changed, 46 insertions(+), 29 deletions(-)

diff --git a/docs/content/access_hdfs.html.md.erb 
b/docs/content/access_hdfs.html.md.erb
index fd87c709..e5a970ec 100644
--- a/docs/content/access_hdfs.html.md.erb
+++ b/docs/content/access_hdfs.html.md.erb
@@ -25,7 +25,7 @@ PXF is compatible with Cloudera, Hortonworks Data Platform, 
and generic Apache H
 
 ## <a id="hdfs_arch"></a>Architecture
 
-HDFS is the primary distributed storage mechanism used by Apache Hadoop. When 
a user or application performs a query on a PXF external table that references 
an HDFS file, the Greenplum Database coordinator host dispatches the query to 
all segment instances. Each segment instance contacts the PXF Service running 
on its host. When it receives the request from a segment instance, the PXF 
Service:
+HDFS is the primary distributed storage mechanism used by Apache Hadoop. When 
a user or application performs a query on a PXF external table that references 
an HDFS file, the Apache Cloudberry coordinator host dispatches the query to 
all segment instances. Each segment instance contacts the PXF Service running 
on its host. When it receives the request from a segment instance, the PXF 
Service:
 
 1. Allocates a worker thread to serve the request from the segment instance.
 2. Invokes the HDFS Java API to request metadata information for the HDFS file 
from the HDFS NameNode. 
@@ -34,25 +34,25 @@ HDFS is the primary distributed storage mechanism used by 
Apache Hadoop. When a
 
 ![Greenplum Platform Extenstion Framework to Hadoop 
Architecture](graphics/pxfarch.png "Apache Cloudberry Platform Extension 
Framework-to-Hadoop Architecture")
 
-A PXF worker thread works on behalf of a segment instance. A worker thread 
uses its Greenplum Database `gp_segment_id` and the file block information 
described in the metadata to assign itself a specific portion of the query 
data. This data may reside on one or more HDFS DataNodes.
+A PXF worker thread works on behalf of a segment instance. A worker thread 
uses its Apache Cloudberry `gp_segment_id` and the file block information 
described in the metadata to assign itself a specific portion of the query 
data. This data may reside on one or more HDFS DataNodes.
 
-The PXF worker thread invokes the HDFS Java API to read the data and delivers 
it to the segment instance. The segment instance delivers its portion of the 
data to the Greenplum Database coordinator host. This communication occurs 
across segment hosts and segment instances in parallel.
+The PXF worker thread invokes the HDFS Java API to read the data and delivers 
it to the segment instance. The segment instance delivers its portion of the 
data to the Apache Cloudberry coordinator host. This communication occurs 
across segment hosts and segment instances in parallel.
 
 
 ## <a id="hadoop_prereq"></a>Prerequisites
 
 Before working with Hadoop data using PXF, ensure that:
 
-- You have configured PXF, and PXF is running on each Greenplum Database host. 
See [Configuring PXF](instcfg_pxf.html) for additional information.
+- You have configured PXF, and PXF is running on each Apache Cloudberry host. 
See [Configuring PXF](instcfg_pxf.html) for additional information.
 - You have configured the PXF Hadoop Connectors that you plan to use. Refer to 
[Configuring PXF Hadoop Connectors](client_instcfg.html) for instructions. If 
you plan to access JSON-formatted data stored in a Cloudera Hadoop cluster, PXF 
requires a Cloudera version 5.8 or later Hadoop distribution.
-- If user impersonation is enabled (the default), ensure that you have granted 
read (and write as appropriate) permission to the HDFS files and directories 
that will be accessed as external tables in Greenplum Database to each 
Greenplum Database user/role name that will access the HDFS files and 
directories. If user impersonation is not enabled, you must grant this 
permission to the `gpadmin` user.
-- Time is synchronized between the Greenplum Database hosts and the external 
Hadoop systems.
+- If user impersonation is enabled (the default), ensure that you have granted 
read (and write as appropriate) permission to the HDFS files and directories 
that will be accessed as external tables in Apache Cloudberry to each Apache 
Cloudberry user/role name that will access the HDFS files and directories. If 
user impersonation is not enabled, you must grant this permission to the 
`gpadmin` user.
+- Time is synchronized between the Apache Cloudberry hosts and the external 
Hadoop systems.
 
 
 ## <a id="hdfs_cmdline"></a>HDFS Shell Command Primer
 Examples in the PXF Hadoop topics access files on HDFS. You can choose to 
access files that already exist in your HDFS cluster. Or, you can follow the 
steps in the examples to create new files.
 
-A Hadoop installation includes command-line tools that interact directly with 
your HDFS file system. These tools support typical file system operations that 
include copying and listing files, changing file permissions, and so forth. You 
run these tools on a system with a Hadoop client installation. By default, 
Greenplum Database hosts do not
+A Hadoop installation includes command-line tools that interact directly with 
your HDFS file system. These tools support typical file system operations that 
include copying and listing files, changing file permissions, and so forth. You 
run these tools on a system with a Hadoop client installation. By default, 
Apache Cloudberry hosts do not
 include a Hadoop client installation.
 
 The HDFS file system command syntax is `hdfs dfs <options> [<file>]`. Invoked 
with no options, `hdfs dfs` lists the file system options supported by the tool.
@@ -103,26 +103,26 @@ The PXF Hadoop connectors provide built-in profiles to 
support the following dat
 
 The PXF Hadoop connectors expose the following profiles to read, and in many 
cases write, these supported data formats:
 
-| Data Source | Data Format | Profile Name(s) | Deprecated Profile Name | 
Supported Operations |
+| Data Source | Data Format | Profile Name(s) | Foreign Data Wrapper format | 
Supported Operations |
 |-------------|------|---------|-----|-----|
-| HDFS | delimited single line [text](hdfs_text.html#profile_text) | hdfs:text 
| n/a | Read, Write |
-| HDFS | delimited single line comma-separated values of 
[text](hdfs_text.html#profile_text) | hdfs:csv | n/a | Read, Write |
-| HDFS | multi-byte or multi-character delimited single line 
[csv](hdfs_text.html#multibyte_delim) | hdfs:csv | n/a | Read |
-| HDFS | fixed width single line [text](hdfs_fixedwidth.html) | 
hdfs:fixedwidth | n/a | Read, Write |
-| HDFS | delimited [text with quoted 
linefeeds](hdfs_text.html#profile_textmulti) | hdfs:text:multi | n/a | Read |
-| HDFS | [Avro](hdfs_avro.html) | hdfs:avro | n/a | Read, Write |
-| HDFS | [JSON](hdfs_json.html) | hdfs:json | n/a | Read, Write |
-| HDFS | [ORC](hdfs_orc.html) | hdfs:orc | n/a | Read, Write |
-| HDFS | [Parquet](hdfs_parquet.html) | hdfs:parquet | n/a | Read, Write |
-| HDFS | AvroSequenceFile | hdfs:AvroSequenceFile | n/a | Read, Write |
-| HDFS | [SequenceFile](hdfs_seqfile.html) | hdfs:SequenceFile | n/a | Read, 
Write |
-| [Hive](hive_pxf.html) | stored as TextFile | hive, [hive:text] 
(hive_pxf.html#hive_text) | Hive, HiveText | Read |
-| [Hive](hive_pxf.html) | stored as SequenceFile | hive | Hive | Read |
-| [Hive](hive_pxf.html) | stored as RCFile | hive, 
[hive:rc](hive_pxf.html#hive_hiverc) | Hive, HiveRC | Read |
-| [Hive](hive_pxf.html) | stored as ORC | hive, 
[hive:orc](hive_pxf.html#hive_orc) | Hive, HiveORC, HiveVectorizedORC | Read |
-| [Hive](hive_pxf.html) | stored as Parquet | hive | Hive | Read |
-| [Hive](hive_pxf.html) | stored as Avro | hive | Hive | Read |
-| [HBase](hbase_pxf.html) | Any | hbase | HBase | Read |
+| HDFS | delimited single line [text](hdfs_text.html#profile_text) | hdfs:text 
| text | Read, Write |
+| HDFS | delimited single line comma-separated values of 
[text](hdfs_text.html#profile_text) | hdfs:csv | csv | Read, Write |
+| HDFS | multi-byte or multi-character delimited single line 
[csv](hdfs_text.html#multibyte_delim) | hdfs:csv | csv | Read |
+| HDFS | fixed width single line [text](hdfs_fixedwidth.html) | 
hdfs:fixedwidth | | Read, Write |
+| HDFS | delimited [text with quoted 
linefeeds](hdfs_text.html#profile_textmulti) | hdfs:text:multi | text:multi | 
Read |
+| HDFS | [Avro](hdfs_avro.html) | hdfs:avro | avro | Read, Write |
+| HDFS | [JSON](hdfs_json.html) | hdfs:json | json | Read, Write |
+| HDFS | [ORC](hdfs_orc.html) | hdfs:orc | orc | Read, Write |
+| HDFS | [Parquet](hdfs_parquet.html) | hdfs:parquet | parquet | Read, Write |
+| HDFS | AvroSequenceFile | hdfs:AvroSequenceFile | AvroSequenceFile | Read, 
Write |
+| HDFS | [SequenceFile](hdfs_seqfile.html) | hdfs:SequenceFile | SequenceFile 
| Read, Write |
+| [Hive](hive_pxf.html) | stored as TextFile | hive, 
[hive:text](hive_pxf.html#hive_text) |  | Read |
+| [Hive](hive_pxf.html) | stored as SequenceFile | hive |  | Read |
+| [Hive](hive_pxf.html) | stored as RCFile | hive, 
[hive:rc](hive_pxf.html#hive_hiverc) | | Read |
+| [Hive](hive_pxf.html) | stored as ORC | hive, 
[hive:orc](hive_pxf.html#hive_orc) | orc | Read |
+| [Hive](hive_pxf.html) | stored as Parquet | hive | | Read |
+| [Hive](hive_pxf.html) | stored as Avro | hive | | Read |
+| [HBase](hbase_pxf.html) | Any | hbase | - | Read |
 
 ### <a id="choose_profile"></a>Choosing the Profile
 
@@ -143,12 +143,29 @@ When accessing ORC-format data:
 Choose the `hdfs:parquet` profile when the file is Parquet, you know the 
location of the file in the HDFS file system, and you want to take advantage of 
extended filter pushdown support for additional data types and operators.
 
 
-### <a id="specify_profile"></a>Specifying the Profile
+### <a id="specify_profile"></a>Specifying the Profile for External Tables
 
-You must provide the profile name when you specify the `pxf` protocol in a 
`CREATE EXTERNAL TABLE` command to create a Greenplum Database external table 
that references a Hadoop file or directory, HBase table, or Hive table. For 
example, the following command creates an external table that uses the default 
server and specifies the profile named `hdfs:text` to access the HDFS file 
`/data/pxf_examples/pxf_hdfs_simple.txt`:
+You must provide the profile name when you specify the `pxf` protocol in a 
`CREATE EXTERNAL TABLE` command to create a Apache Cloudberry external table 
that references a Hadoop file or directory, HBase table, or Hive table. For 
example, the following command creates an external table that uses the default 
server and specifies the profile named `hdfs:text` to access the HDFS file 
`/data/pxf_examples/pxf_hdfs_simple.txt`:
 
 ``` sql
 CREATE EXTERNAL TABLE pxf_hdfs_text(location text, month text, num_orders int, 
total_sales float8)
    LOCATION ('pxf://data/pxf_examples/pxf_hdfs_simple.txt?PROFILE=hdfs:text')
 FORMAT 'TEXT' (delimiter=E',');
 ```
+
+### <a id="specify_fdw_profile"></a>Specifying the Profile for Foreign Tables
+
+When you use the `hdfs_pxf_fdw`, `hive_pxf_fdw`, or `hbase_pxf_fdw` foreign 
data wrapper in a `CREATE FOREIGN TABLE` command, you must specify a server 
name you configuredin Prerequisites section above. The foreign table can 
reference a Hadoop file or directory, an HBase table, or a Hive table. For 
example, the following commands create a foreign server named `hadoop_server` 
with the `hdfs_pxf_fdw` foreign data wrapper, then create a foreign table that 
uses the `text` format to access th [...]
+
+``` sql
+CREATE SERVER hadoop_server FOREIGN DATA WRAPPER hdfs_pxf_fdw;
+CREATE USER MAPPING FOR CURRENT_USER SERVER hadoop_server;
+
+CREATE FOREIGN TABLE pxf_parquet_s3 (location text, month text, num_orders 
int, total_sales float8)
+SERVER hadoop_server
+OPTIONS (
+  resource 'data/pxf_examples/pxf_hdfs_simple.txt',
+  format 'text',
+  delimiter=E','
+)
+```
diff --git a/docs/content/hive_pxf.html.md.erb 
b/docs/content/hive_pxf.html.md.erb
index 3884b12c..4b470c74 100644
--- a/docs/content/hive_pxf.html.md.erb
+++ b/docs/content/hive_pxf.html.md.erb
@@ -335,7 +335,7 @@ Use the `hive:rc` profile to query RCFile-formatted data in 
a Hive table.
 
 ## <a id="hive_orc"></a>Accessing ORC-Format Hive Tables
 
-The Optimized Row Columnar (ORC) file format is a columnar file format that 
provides a highly efficient way to both store and access HDFS data. ORC format 
offers improvements over text and RCFile formats in terms of both compression 
and performance. PXF supports ORC version 1.2.1.
+The Optimized Row Columnar (ORC) file format is a columnar file format that 
provides a highly efficient way to both store and access HDFS data. ORC format 
offers improvements over text and RCFile formats in terms of both compression 
and performance.
 
 ORC is type-aware and specifically designed for Hadoop workloads. ORC files 
store both the type of and encoding information for the data in the file. All 
columns within a single group of row data (also known as stripe) are stored 
together on disk in ORC format files. The columnar nature of the ORC format 
type enables read projection, helping avoid accessing unnecessary columns 
during a query.
 


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

(cloudberry-pxf) branch main updated: docs - pxf_fdw works with HDFS data sources

Reply via email to