This is an automated email from the ASF dual-hosted git repository.
lzljs3620320 pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/paimon.git
The following commit(s) were added to refs/heads/master by this push:
new 1650c98bd5 [doc] Modify Python API to JVM free (#6242)
1650c98bd5 is described below
commit 1650c98bd52c7a62ab75ac65b1054ad78650b2f5
Author: umi <[email protected]>
AuthorDate: Thu Sep 11 22:26:53 2025 +0800
[doc] Modify Python API to JVM free (#6242)
---
docs/content/program-api/python-api.md | 229 +++++++--------------
.../pypaimon/tests/reader_append_only_test.py | 2 +-
2 files changed, 79 insertions(+), 152 deletions(-)
diff --git a/docs/content/program-api/python-api.md
b/docs/content/program-api/python-api.md
index 691bcd5cdf..90779ed266 100644
--- a/docs/content/program-api/python-api.md
+++ b/docs/content/program-api/python-api.md
@@ -3,8 +3,9 @@ title: "Python API"
weight: 5
type: docs
aliases:
-- /api/python-api.html
+ - /api/python-api.html
---
+
<!--
Licensed to the Apache Software Foundation (ASF) under one
or more contributor license agreements. See the NOTICE file
@@ -26,99 +27,31 @@ under the License.
# Java-based Implementation For Python API
-[Python SDK ](https://github.com/apache/paimon-python) has defined Python API
for Paimon. Currently, there is only a Java-based implementation.
-
-Java-based implementation will launch a JVM and use `py4j` to execute Java
code to read and write Paimon table.
+[Python SDK ](https://github.com/apache/paimon-python) has defined Python API
for Paimon.
## Environment Settings
### SDK Installing
SDK is published at [pypaimon](https://pypi.org/project/pypaimon/). You can
install by
-```shell
-pip install pypaimon
-```
-### Java Runtime Environment
-
-This SDK needs JRE 1.8. After installing JRE, make sure that at least one of
the following conditions is met:
-1. `java` command is available. You can verify it by `java -version`.
-2. `JAVA_HOME` and `PATH` variables are set correctly. For example, you can
set:
```shell
-export JAVA_HOME=/path/to/java-directory
-export PATH=$JAVA_HOME/bin:$PATH
-```
-
-### Set Environment Variables
-
-Because we need to launch a JVM to access Java code, JVM environment need to
be set. Besides, the java code need Hadoop
-dependencies, so hadoop environment should be set.
-
-#### Java classpath
-
-The package has set necessary paimon core dependencies (Local/Hadoop FileIO,
Avro/Orc/Parquet format support and
-FileSystem/Jdbc/Hive catalog), so If you just test codes in local or in hadoop
environment, you don't need to set classpath.
-
-If you need other dependencies such as OSS/S3 filesystem jars, or special
format and catalog ,please prepare jars and set
-classpath via one of the following ways:
-
-1. Set system environment variable: ```export
_PYPAIMON_JAVA_CLASSPATH=/path/to/jars/*```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_JAVA_CLASSPATH] = '/path/to/jars/*'
-```
-
-#### JVM args (optional)
-
-You can set JVM args via one of the following ways:
-
-1. Set system environment variable: ```export _PYPAIMON_JVM_ARGS='arg1 arg2
...'```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_JVM_ARGS] = 'arg1 arg2 ...'
-```
-
-#### Hadoop classpath
-
-If the machine is in a hadoop environment, please ensure the value of the
environment variable HADOOP_CLASSPATH include
-path to the common Hadoop libraries, then you don't need to set hadoop.
-
-Otherwise, you should set hadoop classpath via one of the following ways:
-
-1. Set system environment variable: ```export
_PYPAIMON_HADOOP_CLASSPATH=/path/to/jars/*```
-2. Set environment variable in Python code:
-
-```python
-import os
-from pypaimon import constants
-
-os.environ[constants.PYPAIMON_HADOOP_CLASSPATH] = '/path/to/jars/*'
+pip install pypaimon
```
-If you just want to test codes in local, we recommend to use [Flink
Pre-bundled hadoop
jar](https://flink.apache.org/downloads/#additional-components).
-
-
## Create Catalog
Before coming into contact with the Table, you need to create a Catalog.
```python
-from pypaimon import Catalog
+from pypaimon.catalog.catalog_factory import CatalogFactory
# Note that keys and values are all string
catalog_options = {
'metastore': 'filesystem',
'warehouse': 'file:///path/to/warehouse'
}
-catalog = Catalog.create(catalog_options)
+catalog = CatalogFactory.create(catalog_options)
```
## Create Database & Table
@@ -126,19 +59,20 @@ catalog = Catalog.create(catalog_options)
You can use the catalog to create table for writing data.
### Create Database (optional)
+
Table is located in a database. If you want to create table in a new database,
you should create it.
```python
catalog.create_database(
- name='database_name',
- ignore_if_exists=True, # If you want to raise error if the database
exists, set False
- properties={'key': 'value'} # optional database properties
+ name='database_name',
+ ignore_if_exists=True, # If you want to raise error if the database
exists, set False
+ properties={'key': 'value'} # optional database properties
)
```
### Create Schema
-Table schema contains fields definition, partition keys, primary keys, table
options and comment.
+Table schema contains fields definition, partition keys, primary keys, table
options and comment.
The field definition is described by `pyarrow.Schema`. All arguments except
fields definition are optional.
Generally, there are two ways to build `pyarrow.Schema`.
@@ -157,16 +91,15 @@ pa_schema = pa.schema([
('value', pa.string())
])
-schema = Schema(
- pa_schema=pa_schema,
+schema = Schema.from_pyarrow_schema(
+ pa_schema=pa_schema,
partition_keys=['dt', 'hh'],
primary_keys=['dt', 'hh', 'pk'],
options={'bucket': '2'},
- comment='my test table'
-)
+ comment='my test table')
```
-See [Data Types]({{< ref "python-api#data-types" >}}) for all supported
`pyarrow-to-paimon` data types mapping.
+See [Data Types]({{< ref "python-api#data-types" >}}) for all supported
`pyarrow-to-paimon` data types mapping.
Second, if you have some Pandas data, the `pa_schema` can be extracted from
`DataFrame`:
@@ -187,9 +120,9 @@ dataframe = pd.DataFrame(data)
# Get Paimon Schema
record_batch = pa.RecordBatch.from_pandas(dataframe)
-schema = Schema(
- pa_schema=record_batch.schema,
- partition_keys=['dt', 'hh'],
+schema = Schema.from_pyarrow_schema(
+ pa_schema=record_batch.schema,
+ partition_keys=['dt', 'hh'],
primary_keys=['dt', 'hh', 'pk'],
options={'bucket': '2'},
comment='my test table'
@@ -204,8 +137,8 @@ After building table schema, you can create corresponding
table:
schema = ...
catalog.create_table(
identifier='database_name.table_name',
- schema=schema,
- ignore_if_exists=True # If you want to raise error if the table exists,
set False
+ schema=schema,
+ ignore_if_exists=True # If you want to raise error if the table exists,
set False
)
```
@@ -217,13 +150,56 @@ The Table interface provides tools to read and write
table.
table = catalog.get_table('database_name.table_name')
```
-## Batch Read
+## Batch Write
+
+Paimon table write is Two-Phase Commit, you can write many times, but once
committed, no more data can be written.
+
+{{< hint warning >}}
+Currently, the feature of writing multiple times and committing once only
supports append only table.
+{{< /hint >}}
+
+```python
+table = catalog.get_table('database_name.table_name')
+
+# 1. Create table write and commit
+write_builder = table.new_batch_write_builder()
+table_write = write_builder.new_write()
+table_commit = write_builder.new_commit()
+
+# 2. Write data. Support 3 methods:
+# 2.1 Write pandas.DataFrame
+dataframe = ...
+table_write.write_pandas(dataframe)
+
+# 2.2 Write pyarrow.Table
+pa_table = ...
+table_write.write_arrow(pa_table)
+
+# 2.3 Write pyarrow.RecordBatch
+record_batch = ...
+table_write.write_arrow_batch(record_batch)
+
+# 3. Commit data
+commit_messages = table_write.prepare_commit()
+table_commit.commit(commit_messages)
-### Set Read Parallelism
+# 4. Close resources
+table_write.close()
+table_commit.close()
+```
-TableRead interface provides parallelly reading for multiple splits. You can
set `'max-workers': 'N'` in `catalog_options`
-to set thread numbers for reading splits. `max-workers` is 1 by default, that
means TableRead will read splits sequentially
-if you doesn't set `max-workers`.
+By default, the data will be appended to table. If you want to overwrite
table, you should use `TableWrite#overwrite`
+API:
+
+```python
+# overwrite whole table
+write_builder.overwrite()
+
+# overwrite partition 'dt=2024-01-01'
+write_builder.overwrite({'dt': '2024-01-01'})
+```
+
+## Batch Read
### Get ReadBuilder and Perform pushdown
@@ -296,7 +272,7 @@ You can also read data into a `pyarrow.RecordBatchReader`
and iterate record bat
```python
table_read = read_builder.new_read()
-for batch in table_read.to_arrow_batch_reader(splits):
+for batch in table_read.to_iterator(splits):
print(batch)
# pyarrow.RecordBatch
@@ -374,66 +350,17 @@ print(ray_dataset.to_pandas())
# ...
```
-## Batch Write
-
-Paimon table write is Two-Phase Commit, you can write many times, but once
committed, no more data can be written.
-
-{{< hint warning >}}
-Currently, Python SDK doesn't support writing primary key table with
`bucket=-1`.
-{{< /hint >}}
-
-```python
-table = catalog.get_table('database_name.table_name')
-
-# 1. Create table write and commit
-write_builder = table.new_batch_write_builder()
-table_write = write_builder.new_write()
-table_commit = write_builder.new_commit()
-
-# 2. Write data. Support 3 methods:
-# 2.1 Write pandas.DataFrame
-dataframe = ...
-table_write.write_pandas(dataframe)
-
-# 2.2 Write pyarrow.Table
-pa_table = ...
-table_write.write_arrow(pa_table)
-
-# 2.3 Write pyarrow.RecordBatch
-record_batch = ...
-table_write.write_arrow_batch(record_batch)
-
-# 3. Commit data
-commit_messages = table_write.prepare_commit()
-table_commit.commit(commit_messages)
-
-# 4. Close resources
-table_write.close()
-table_commit.close()
-```
-
-By default, the data will be appended to table. If you want to overwrite
table, you should use `TableWrite#overwrite` API:
-
-```python
-# overwrite whole table
-write_builder.overwrite()
-
-# overwrite partition 'dt=2024-01-01'
-write_builder.overwrite({'dt': '2024-01-01'})
-```
-
## Data Types
-| pyarrow | Paimon |
-|:-----------------------------------------|:---------|
-| pyarrow.int8() | TINYINT |
-| pyarrow.int16() | SMALLINT |
-| pyarrow.int32() | INT |
-| pyarrow.int64() | BIGINT |
-| pyarrow.float16() <br/>pyarrow.float32() | FLOAT |
-| pyarrow.float64() | DOUBLE |
-| pyarrow.string() | STRING |
-| pyarrow.boolean() | BOOLEAN |
+| pyarrow | Paimon
|
+|:-----------------------------------------------------------------|:---------|
+| pyarrow.int8() | TINYINT |
+| pyarrow.int16() | SMALLINT |
+| pyarrow.int32() | INT |
+| pyarrow.int64() | BIGINT |
+| pyarrow.float16() <br/>pyarrow.float32() <br/>pyarrow.float64() | FLOAT |
+| pyarrow.string() | STRING |
+| pyarrow.boolean() | BOOLEAN |
## Predicate
diff --git a/paimon-python/pypaimon/tests/reader_append_only_test.py
b/paimon-python/pypaimon/tests/reader_append_only_test.py
index ca22c79fbe..72b421aa94 100644
--- a/paimon-python/pypaimon/tests/reader_append_only_test.py
+++ b/paimon-python/pypaimon/tests/reader_append_only_test.py
@@ -125,7 +125,7 @@ class AoReaderTest(unittest.TestCase):
p3 = predicate_builder.between('user_id', 0, 6) # [2/b, 3/c, 4/d,
5/e, 6/f] left
p4 = predicate_builder.is_not_in('behavior', ['b', 'e']) # [3/c, 4/d,
6/f] left
p5 = predicate_builder.is_in('dt', ['p1']) # exclude 3/c
- p6 = predicate_builder.is_not_null('behavior') # exclude 4/d
+ p6 = predicate_builder.is_not_null('behavior') # exclude 4/d
g1 = predicate_builder.and_predicates([p1, p2, p3, p4, p5, p6])
read_builder = table.new_read_builder().with_filter(g1)
actual = self._read_test_table(read_builder)