This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 548325adef [doc] Separate PyPaimon documentations
548325adef is described below
commit 548325adef1decc5c16cdd5de3a22d65d1a73eba
Author: JingsongLi <[email protected]>
AuthorDate: Fri Jan 23 18:06:50 2026 +0800
[doc] Separate PyPaimon documentations
---
docs/content/concepts/rest/overview.md | 8 -
docs/content/maintenance/_index.md | 2 +-
docs/content/program-api/_index.md | 2 +-
docs/content/{program-api => pypaimon}/_index.md | 2 +-
docs/content/pypaimon/data-evolution.md | 80 +++++++++
docs/content/pypaimon/overview.md | 54 ++++++
.../{program-api => pypaimon}/python-api.md | 190 +--------------------
docs/content/pypaimon/pytorch.md | 60 +++++++
docs/content/pypaimon/ray-data.md | 142 +++++++++++++++
9 files changed, 345 insertions(+), 195 deletions(-)
diff --git a/docs/content/concepts/rest/overview.md
b/docs/content/concepts/rest/overview.md
index 47c83d3322..dee9a9c257 100644
--- a/docs/content/concepts/rest/overview.md
+++ b/docs/content/concepts/rest/overview.md
@@ -66,11 +66,3 @@ RESTCatalog supports multiple access authentication methods,
including the follo
## REST Open API
See [REST API]({{< ref "concepts/rest/rest-api" >}}).
-
-## REST Java API
-
-See [REST Java API]({{< ref "program-api/rest-api" >}}).
-
-## REST Python API
-
-See [REST Python API]({{< ref "program-api/python-api" >}}).
\ No newline at end of file
diff --git a/docs/content/maintenance/_index.md
b/docs/content/maintenance/_index.md
index bfea9b72ff..197e93f0cc 100644
--- a/docs/content/maintenance/_index.md
+++ b/docs/content/maintenance/_index.md
@@ -3,7 +3,7 @@ title: Maintenance
icon: <i class="fa fa-wrench title maindish" aria-hidden="true"></i>
bold: true
bookCollapseSection: true
-weight: 95
+weight: 94
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/content/program-api/_index.md
b/docs/content/program-api/_index.md
index c5ba04fc0d..0449fd8416 100644
--- a/docs/content/program-api/_index.md
+++ b/docs/content/program-api/_index.md
@@ -3,7 +3,7 @@ title: Program API
icon: <i class="fa fa-briefcase title maindish" aria-hidden="true"></i>
bold: true
bookCollapseSection: true
-weight: 96
+weight: 95
---
<!--
Licensed to the Apache Software Foundation (ASF) under one
diff --git a/docs/content/program-api/_index.md
b/docs/content/pypaimon/_index.md
similarity index 97%
copy from docs/content/program-api/_index.md
copy to docs/content/pypaimon/_index.md
index c5ba04fc0d..b89939b6fc 100644
--- a/docs/content/program-api/_index.md
+++ b/docs/content/pypaimon/_index.md
@@ -1,5 +1,5 @@
---
-title: Program API
+title: PyPaimon
icon: <i class="fa fa-briefcase title maindish" aria-hidden="true"></i>
bold: true
bookCollapseSection: true
diff --git a/docs/content/pypaimon/data-evolution.md
b/docs/content/pypaimon/data-evolution.md
new file mode 100644
index 0000000000..bd0bde03a4
--- /dev/null
+++ b/docs/content/pypaimon/data-evolution.md
@@ -0,0 +1,80 @@
+---
+title: "Data Evolution"
+weight: 5
+type: docs
+aliases:
+ - /pypaimon/data-evolution.html
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Data Evolution
+
+PyPaimon for Data Evolution mode. See [Data Evolution]({{< ref
"append-table/data-evolution" >}}).
+
+## Update Columns By Row ID
+
+You can create `TableUpdate.update_by_arrow_with_row_id` to update columns to
data evolution tables.
+
+The input data should include the `_ROW_ID` column, update operation will
automatically sort and match each `_ROW_ID` to
+its corresponding `first_row_id`, then groups rows with the same
`first_row_id` and writes them to a separate file.
+
+```python
+simple_pa_schema = pa.schema([
+ ('f0', pa.int8()),
+ ('f1', pa.int16()),
+])
+schema = Schema.from_pyarrow_schema(simple_pa_schema,
+ options={'row-tracking.enabled': 'true',
'data-evolution.enabled': 'true'})
+catalog.create_table('default.test_row_tracking', schema, False)
+table = catalog.get_table('default.test_row_tracking')
+
+# write all columns
+write_builder = table.new_batch_write_builder()
+table_write = write_builder.new_write()
+table_commit = write_builder.new_commit()
+expect_data = pa.Table.from_pydict({
+ 'f0': [-1, 2],
+ 'f1': [-1001, 1002]
+}, schema=simple_pa_schema)
+table_write.write_arrow(expect_data)
+table_commit.commit(table_write.prepare_commit())
+table_write.close()
+table_commit.close()
+
+# update partial columns
+write_builder = table.new_batch_write_builder()
+table_update = write_builder.new_update().with_update_type(['f0'])
+table_commit = write_builder.new_commit()
+data2 = pa.Table.from_pydict({
+ '_ROW_ID': [0, 1],
+ 'f0': [5, 6],
+}, schema=pa.schema([
+ ('_ROW_ID', pa.int64()),
+ ('f0', pa.int8()),
+]))
+cmts = table_update.update_by_arrow_with_row_id(data2)
+table_commit.commit(cmts)
+table_commit.close()
+
+# content should be:
+# 'f0': [5, 6],
+# 'f1': [-1001, 1002]
+```
diff --git a/docs/content/pypaimon/overview.md
b/docs/content/pypaimon/overview.md
new file mode 100644
index 0000000000..2006405eb2
--- /dev/null
+++ b/docs/content/pypaimon/overview.md
@@ -0,0 +1,54 @@
+---
+title: "Overview"
+weight: 1
+type: docs
+aliases:
+- /pypaimon/overview.html
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Overview
+
+PyPaimon is a Python implementation for connecting Paimon catalog, reading &
writing tables. The complete Python
+implementation of the brand new PyPaimon does not require JDK installation.
+
+## Environment Settings
+
+SDK is published at [pypaimon](https://pypi.org/project/pypaimon/). You can
install by
+
+```shell
+pip install pypaimon
+```
+
+## Build From Source
+
+You can build the source package by executing the following command:
+
+```commandline
+python3 setup.py sdist
+```
+
+The package is under `dist/`. Then you can install the package by executing
the following command:
+
+```commandline
+pip3 install dist/*.tar.gz
+```
+
+The command will install the package and core dependencies to your local
Python environment.
diff --git a/docs/content/program-api/python-api.md
b/docs/content/pypaimon/python-api.md
similarity index 74%
rename from docs/content/program-api/python-api.md
rename to docs/content/pypaimon/python-api.md
index 606937585c..04c7248126 100644
--- a/docs/content/program-api/python-api.md
+++ b/docs/content/pypaimon/python-api.md
@@ -1,9 +1,9 @@
---
title: "Python API"
-weight: 5
+weight: 2
type: docs
aliases:
- - /api/python-api.html
+ - /pypaimon/python-api.html
---
<!--
@@ -27,23 +27,13 @@ under the License.
# Python API
-PyPaimon is a Python implementation for connecting Paimon catalog, reading &
writing tables. The complete Python
-implementation of the brand new PyPaimon does not require JDK installation.
-
-## Environment Settings
-
-SDK is published at [pypaimon](https://pypi.org/project/pypaimon/). You can
install by
-
-```shell
-pip install pypaimon
-```
-
## Create Catalog
Before coming into contact with the Table, you need to create a Catalog.
{{< tabs "create-catalog" >}}
{{< tab "filesystem" >}}
+
```python
from pypaimon import CatalogFactory
@@ -53,9 +43,10 @@ catalog_options = {
}
catalog = CatalogFactory.create(catalog_options)
```
+
{{< /tab >}}
{{< tab "rest catalog" >}}
-The sample code is as follows. The detailed meaning of option can be found in
[DLF Token](../concepts/rest/dlf.md).
+The sample code is as follows. The detailed meaning of option can be found in
[REST]({{< ref "concepts/rest/overview" >}}).
```python
from pypaimon import CatalogFactory
@@ -65,10 +56,7 @@ catalog_options = {
'metastore': 'rest',
'warehouse': 'xxx',
'uri': 'xxx',
- 'dlf.region': 'xxx',
- 'token.provider': 'xxx',
- 'dlf.access-key-id': 'xxx',
- 'dlf.access-key-secret': 'xxx'
+ 'token.provider': 'xxx'
}
catalog = CatalogFactory.create(catalog_options)
```
@@ -194,18 +182,6 @@ table_write.write_arrow(pa_table)
record_batch = ...
table_write.write_arrow_batch(record_batch)
-# 2.4 Write Ray Dataset (requires ray to be installed)
-import ray
-ray_dataset = ray.data.read_json("/path/to/data.jsonl")
-table_write.write_ray(ray_dataset, overwrite=False, concurrency=2)
-# Parameters:
-# - dataset: Ray Dataset to write
-# - overwrite: Whether to overwrite existing data (default: False)
-# - concurrency: Optional max number of concurrent Ray tasks
-# - ray_remote_args: Optional kwargs passed to ray.remote() (e.g.,
{"num_cpus": 2})
-# Note: write_ray() handles commit internally through Ray Datasink API.
-# Skip steps 3-4 if using write_ray() - just close the writer.
-
# 3. Commit data (required for write_pandas/write_arrow/write_arrow_batch only)
commit_messages = table_write.prepare_commit()
table_commit.commit(commit_messages)
@@ -226,56 +202,6 @@ write_builder = table.new_batch_write_builder().overwrite()
write_builder = table.new_batch_write_builder().overwrite({'dt': '2024-01-01'})
```
-### Update columns
-
-You can create `TableUpdate.update_by_arrow_with_row_id` to update columns to
data evolution tables.
-
-The input data should include the `_ROW_ID` column, update operation will
automatically sort and match each `_ROW_ID` to
-its corresponding `first_row_id`, then groups rows with the same
`first_row_id` and writes them to a separate file.
-
-```python
-simple_pa_schema = pa.schema([
- ('f0', pa.int8()),
- ('f1', pa.int16()),
-])
-schema = Schema.from_pyarrow_schema(simple_pa_schema,
- options={'row-tracking.enabled': 'true',
'data-evolution.enabled': 'true'})
-catalog.create_table('default.test_row_tracking', schema, False)
-table = catalog.get_table('default.test_row_tracking')
-
-# write all columns
-write_builder = table.new_batch_write_builder()
-table_write = write_builder.new_write()
-table_commit = write_builder.new_commit()
-expect_data = pa.Table.from_pydict({
- 'f0': [-1, 2],
- 'f1': [-1001, 1002]
-}, schema=simple_pa_schema)
-table_write.write_arrow(expect_data)
-table_commit.commit(table_write.prepare_commit())
-table_write.close()
-table_commit.close()
-
-# update partial columns
-write_builder = table.new_batch_write_builder()
-table_update = write_builder.new_update().with_update_type(['f0'])
-table_commit = write_builder.new_commit()
-data2 = pa.Table.from_pydict({
- '_ROW_ID': [0, 1],
- 'f0': [5, 6],
-}, schema=pa.schema([
- ('_ROW_ID', pa.int64()),
- ('f0', pa.int8()),
-]))
-cmts = table_update.update_by_arrow_with_row_id(data2)
-table_commit.commit(cmts)
-table_commit.close()
-
-# content should be:
-# 'f0': [5, 6],
-# 'f1': [-1001, 1002]
-```
-
## Batch Read
### Predicate pushdown
@@ -414,110 +340,6 @@ print(duckdb_con.query("SELECT * FROM duckdb_table WHERE
f0 = 1").fetchdf())
# 0 1 a
```
-### Read Ray
-
-This requires `ray` to be installed.
-
-You can convert the splits into a Ray Dataset and handle it by Ray Data API
for distributed processing:
-
-```python
-table_read = read_builder.new_read()
-ray_dataset = table_read.to_ray(splits)
-
-print(ray_dataset)
-# MaterializedDataset(num_blocks=1, num_rows=9, schema={f0: int32, f1: string})
-
-print(ray_dataset.take(3))
-# [{'f0': 1, 'f1': 'a'}, {'f0': 2, 'f1': 'b'}, {'f0': 3, 'f1': 'c'}]
-
-print(ray_dataset.to_pandas())
-# f0 f1
-# 0 1 a
-# 1 2 b
-# 2 3 c
-# 3 4 d
-# ...
-```
-
-The `to_ray()` method supports Ray Data API parameters for distributed
processing:
-
-```python
-# Basic usage
-ray_dataset = table_read.to_ray(splits)
-
-# Specify number of output blocks
-ray_dataset = table_read.to_ray(splits, override_num_blocks=4)
-
-# Configure Ray remote arguments
-ray_dataset = table_read.to_ray(
- splits,
- override_num_blocks=4,
- ray_remote_args={"num_cpus": 2, "max_retries": 3}
-)
-
-# Use Ray Data operations
-mapped_dataset = ray_dataset.map(lambda row: {'value': row['value'] * 2})
-filtered_dataset = ray_dataset.filter(lambda row: row['score'] > 80)
-df = ray_dataset.to_pandas()
-```
-
-**Parameters:**
-- `override_num_blocks`: Optional override for the number of output blocks. By
default,
- Ray automatically determines the optimal number.
-- `ray_remote_args`: Optional kwargs passed to `ray.remote()` in read tasks
- (e.g., `{"num_cpus": 2, "max_retries": 3}`).
-- `concurrency`: Optional max number of Ray tasks to run concurrently. By
default,
- dynamically decided based on available resources.
-- `**read_args`: Additional kwargs passed to the datasource (e.g.,
`per_task_row_limit`
- in Ray 2.52.0+).
-
-**Ray Block Size Configuration:**
-
-If you need to configure Ray's block size (e.g., when Paimon splits exceed
Ray's default
-128MB block size), set it before calling `to_ray()`:
-
-```python
-from ray.data import DataContext
-
-ctx = DataContext.get_current()
-ctx.target_max_block_size = 256 * 1024 * 1024 # 256MB (default is 128MB)
-ray_dataset = table_read.to_ray(splits)
-```
-
-See [Ray Data API
Documentation](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_datasource.html)
for more details.
-
-### Read Pytorch Dataset
-
-This requires `torch` to be installed.
-
-You can read all the data into a `torch.utils.data.Dataset` or
`torch.utils.data.IterableDataset`:
-
-```python
-from torch.utils.data import DataLoader
-
-table_read = read_builder.new_read()
-dataset = table_read.to_torch(splits, streaming=True)
-dataloader = DataLoader(
- dataset,
- batch_size=2,
- num_workers=2, # Concurrency to read data
- shuffle=False
-)
-
-# Collect all data from dataloader
-for batch_idx, batch_data in enumerate(dataloader):
- print(batch_data)
-
-# output:
-# {'user_id': tensor([1, 2]), 'behavior': ['a', 'b']}
-# {'user_id': tensor([3, 4]), 'behavior': ['c', 'd']}
-# {'user_id': tensor([5, 6]), 'behavior': ['e', 'f']}
-# {'user_id': tensor([7, 8]), 'behavior': ['g', 'h']}
-```
-
-When the `streaming` parameter is true, it will iteratively read;
-when it is false, it will read the full amount of data into memory.
-
### Incremental Read
This API allows reading data committed between two snapshot timestamps. The
steps are as follows.
diff --git a/docs/content/pypaimon/pytorch.md b/docs/content/pypaimon/pytorch.md
new file mode 100644
index 0000000000..b34f49edcd
--- /dev/null
+++ b/docs/content/pypaimon/pytorch.md
@@ -0,0 +1,60 @@
+---
+title: "PyTorch"
+weight: 4
+type: docs
+aliases:
+ - /pypaimon/pytorch.html
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# PyTorch
+
+## Read
+
+This requires `torch` to be installed.
+
+You can read all the data into a `torch.utils.data.Dataset` or
`torch.utils.data.IterableDataset`:
+
+```python
+from torch.utils.data import DataLoader
+
+table_read = read_builder.new_read()
+dataset = table_read.to_torch(splits, streaming=True)
+dataloader = DataLoader(
+ dataset,
+ batch_size=2,
+ num_workers=2, # Concurrency to read data
+ shuffle=False
+)
+
+# Collect all data from dataloader
+for batch_idx, batch_data in enumerate(dataloader):
+ print(batch_data)
+
+# output:
+# {'user_id': tensor([1, 2]), 'behavior': ['a', 'b']}
+# {'user_id': tensor([3, 4]), 'behavior': ['c', 'd']}
+# {'user_id': tensor([5, 6]), 'behavior': ['e', 'f']}
+# {'user_id': tensor([7, 8]), 'behavior': ['g', 'h']}
+```
+
+When the `streaming` parameter is true, it will iteratively read;
+when it is false, it will read the full amount of data into memory.
diff --git a/docs/content/pypaimon/ray-data.md
b/docs/content/pypaimon/ray-data.md
new file mode 100644
index 0000000000..2cc728756a
--- /dev/null
+++ b/docs/content/pypaimon/ray-data.md
@@ -0,0 +1,142 @@
+---
+title: "Ray Data"
+weight: 3
+type: docs
+aliases:
+ - /pypaimon/ray-data.html
+---
+
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+# Ray Data
+
+## Read
+
+This requires `ray` to be installed.
+
+You can convert the splits into a Ray Dataset and handle it by Ray Data API
for distributed processing:
+
+```python
+table_read = read_builder.new_read()
+ray_dataset = table_read.to_ray(splits)
+
+print(ray_dataset)
+# MaterializedDataset(num_blocks=1, num_rows=9, schema={f0: int32, f1: string})
+
+print(ray_dataset.take(3))
+# [{'f0': 1, 'f1': 'a'}, {'f0': 2, 'f1': 'b'}, {'f0': 3, 'f1': 'c'}]
+
+print(ray_dataset.to_pandas())
+# f0 f1
+# 0 1 a
+# 1 2 b
+# 2 3 c
+# 3 4 d
+# ...
+```
+
+The `to_ray()` method supports Ray Data API parameters for distributed
processing:
+
+```python
+# Basic usage
+ray_dataset = table_read.to_ray(splits)
+
+# Specify number of output blocks
+ray_dataset = table_read.to_ray(splits, override_num_blocks=4)
+
+# Configure Ray remote arguments
+ray_dataset = table_read.to_ray(
+ splits,
+ override_num_blocks=4,
+ ray_remote_args={"num_cpus": 2, "max_retries": 3}
+)
+
+# Use Ray Data operations
+mapped_dataset = ray_dataset.map(lambda row: {'value': row['value'] * 2})
+filtered_dataset = ray_dataset.filter(lambda row: row['score'] > 80)
+df = ray_dataset.to_pandas()
+```
+
+**Parameters:**
+- `override_num_blocks`: Optional override for the number of output blocks. By
default,
+ Ray automatically determines the optimal number.
+- `ray_remote_args`: Optional kwargs passed to `ray.remote()` in read tasks
+ (e.g., `{"num_cpus": 2, "max_retries": 3}`).
+- `concurrency`: Optional max number of Ray tasks to run concurrently. By
default,
+ dynamically decided based on available resources.
+- `**read_args`: Additional kwargs passed to the datasource (e.g.,
`per_task_row_limit`
+ in Ray 2.52.0+).
+
+**Ray Block Size Configuration:**
+
+If you need to configure Ray's block size (e.g., when Paimon splits exceed
Ray's default
+128MB block size), set it before calling `to_ray()`:
+
+```python
+from ray.data import DataContext
+
+ctx = DataContext.get_current()
+ctx.target_max_block_size = 256 * 1024 * 1024 # 256MB (default is 128MB)
+ray_dataset = table_read.to_ray(splits)
+```
+
+See [Ray Data API
Documentation](https://docs.ray.io/en/latest/data/api/doc/ray.data.read_datasource.html)
for more details.
+
+## Write
+
+```python
+table = catalog.get_table('database_name.table_name')
+
+# 1. Create table write and commit
+write_builder = table.new_batch_write_builder()
+table_write = write_builder.new_write()
+table_commit = write_builder.new_commit()
+
+# 2 Write Ray Dataset (requires ray to be installed)
+import ray
+ray_dataset = ray.data.read_json("/path/to/data.jsonl")
+table_write.write_ray(ray_dataset, overwrite=False, concurrency=2)
+# Parameters:
+# - dataset: Ray Dataset to write
+# - overwrite: Whether to overwrite existing data (default: False)
+# - concurrency: Optional max number of concurrent Ray tasks
+# - ray_remote_args: Optional kwargs passed to ray.remote() (e.g.,
{"num_cpus": 2})
+# Note: write_ray() handles commit internally through Ray Datasink API.
+# Skip steps 3-4 if using write_ray() - just close the writer.
+
+# 3. Commit data (required for write_pandas/write_arrow/write_arrow_batch only)
+commit_messages = table_write.prepare_commit()
+table_commit.commit(commit_messages)
+
+# 4. Close resources
+table_write.close()
+table_commit.close()
+```
+
+By default, the data will be appended to table. If you want to overwrite
table, you should use `TableWrite#overwrite`
+API:
+
+```python
+# overwrite whole table
+write_builder = table.new_batch_write_builder().overwrite()
+
+# overwrite partition 'dt=2024-01-01'
+write_builder = table.new_batch_write_builder().overwrite({'dt': '2024-01-01'})
+```