This is an automated email from the ASF dual-hosted git repository.
kevinjqliu pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/iceberg-python.git
The following commit(s) were added to refs/heads/main by this push:
new 4a8c84e8 docs: add type mapping tables between PyIceberg and PyArrow
(#3098)
4a8c84e8 is described below
commit 4a8c84e81332ca1b1b426dd77d00375c000dcef2
Author: committobetter <[email protected]>
AuthorDate: Wed Mar 18 00:05:48 2026 +0700
docs: add type mapping tables between PyIceberg and PyArrow (#3098)
<!--
Thanks for opening a pull request!
-->
<!-- In the case this PR will resolve an issue, please replace
${GITHUB_ISSUE_ID} below with the actual Github issue id. -->
Closes #2226
# Rationale for this change
This PR adds documentation with tables describing the type mapping
between PyArrow and PyIceberg data types.
## Are these changes tested?
Yes.
The changes are tested locally as shown in the image below.
<img width="1563" height="792" alt="image"
src="https://github.com/user-attachments/assets/1d9fc6a6-a1ea-4feb-a4d7-71d9dd036813"
/>
## Are there any user-facing changes?
Yes.
This PR adds new user-facing documentation.
<!-- In the case of user-facing changes, please add the changelog label.
-->
---------
Co-authored-by: Kevin Liu <[email protected]>
---
mkdocs/docs/api.md | 84 ++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 84 insertions(+)
diff --git a/mkdocs/docs/api.md b/mkdocs/docs/api.md
index 506547fc..65f91c96 100644
--- a/mkdocs/docs/api.md
+++ b/mkdocs/docs/api.md
@@ -2039,3 +2039,87 @@ DataFrame()
| 3 | 6 |
+---+---+
```
+
+## Type mapping
+
+### PyArrow
+
+The Iceberg specification only specifies type mapping for Avro, Parquet, and
ORC:
+
+- [Iceberg to Avro](https://iceberg.apache.org/spec/#avro)
+
+- [Iceberg to Parquet](https://iceberg.apache.org/spec/#parquet)
+
+- [Iceberg to ORC](https://iceberg.apache.org/spec/#orc)
+
+The following tables describe the type mappings between PyIceberg and PyArrow.
In the tables below, `pa` refers to the `pyarrow` module:
+
+```python
+import pyarrow as pa
+```
+
+#### PyIceberg to PyArrow type mapping
+
+| PyIceberg type class | PyArrow type |
+|---------------------------------|-------------------------------------|
+| `BooleanType` | `pa.bool_()` |
+| `IntegerType` | `pa.int32()` |
+| `LongType` | `pa.int64()` |
+| `FloatType` | `pa.float32()` |
+| `DoubleType` | `pa.float64()` |
+| `DecimalType(p, s)` | `pa.decimal128(p, s)` |
+| `DateType` | `pa.date32()` |
+| `TimeType` | `pa.time64("us")` |
+| `TimestampType` | `pa.timestamp("us")` |
+| `TimestampNanoType` (format version 3 only) | `pa.timestamp("ns")`
[[2]](#notes) |
+| `TimestamptzType` | `pa.timestamp("us", tz="UTC")` [[1]](#notes) |
+| `TimestamptzNanoType` (format version 3 only) | `pa.timestamp("ns",
tz="UTC")` [[1]](#notes) [[2]](#notes) |
+| `StringType` | `pa.large_string()` |
+| `UUIDType` | `pa.uuid()` |
+| `BinaryType` | `pa.large_binary()` |
+| `FixedType(L)` | `pa.binary(L)` |
+| `StructType` | `pa.struct()` |
+| `ListType(e)` | `pa.large_list(e)` |
+| `MapType(k, v)` | `pa.map_(k, v)` |
+| `UnknownType` (format version 3 only) | `pa.null()` [[2]](#notes) |
+
+---
+
+#### PyArrow to PyIceberg type mapping
+
+| PyArrow type | PyIceberg type class |
+|------------------------------------|-----------------------------|
+| `pa.bool_()` | `BooleanType` |
+| `pa.int8()` / `pa.int16()` / `pa.int32()` | `IntegerType` |
+| `pa.int64()` | `LongType` |
+| `pa.float32()` | `FloatType` |
+| `pa.float64()` | `DoubleType` |
+| `pa.decimal128(p, s)` | `DecimalType(p, s)` |
+| `pa.decimal256(p, s)` | Unsupported |
+| `pa.date32()` | `DateType` |
+| `pa.date64()` | Unsupported |
+| `pa.time64("us")` | `TimeType` |
+| `pa.timestamp("s")` / `pa.timestamp("ms")` / `pa.timestamp("us")` |
`TimestampType` |
+| `pa.timestamp("ns")` | `TimestampNanoType` (format version 3 only)
[[2]](#notes) |
+| `pa.timestamp("s", tz="UTC")` / `pa.timestamp("ms", tz="UTC")` /
`pa.timestamp("us", tz="UTC")` | `TimestamptzType` [[1]](#notes) |
+| `pa.timestamp("ns", tz="UTC")` | `TimestamptzNanoType` (format version 3
only) [[1]](#notes) [[2]](#notes) |
+| `pa.string()` / `pa.large_string()` / `pa.string_view()` | `StringType` |
+| `pa.uuid()` | `UUIDType` |
+| `pa.binary()` / `pa.large_binary()` / `pa.binary_view()` | `BinaryType` |
+| `pa.binary(L)` | `FixedType(L)` |
+| `pa.struct([...])` | `StructType` |
+| `pa.list_(e)` / `pa.large_list(e)` / `pa.list_(e, fixed_size)` |
`ListType(e)` |
+| `pa.map_(k, v)` | `MapType(k, v)` |
+| `pa.null()` | `UnknownType` (format version 3 only) [[2]](#notes) |
+
+---
+
+#### Notes
+
+[1] Only the `UTC` timezone and its aliases are supported for
PyArrow-to-PyIceberg timestamp-with-timezone conversion.
+
+[2] The PyArrow-to-PyIceberg mappings for `pa.timestamp("ns")`,
`pa.timestamp("ns", tz="UTC")`, and `pa.null()` require Iceberg format version
3. By default, `pyarrow_to_schema()` uses format version 2.
`TimestampNanoType`, `TimestamptzNanoType`, and `UnknownType` are likewise
format-version-3-only Iceberg types.
+
+[3] For nanosecond Iceberg timestamp types (`TimestampNanoType` and
`TimestamptzNanoType`), writing in format version 3 is not yet implemented (see
[GitHub issue #1551](https://github.com/apache/iceberg-python/issues/1551)).
+
+[4] The mappings are not fully symmetric. On read, PyArrow normalizes some
families of types into a single Iceberg type, and on write PyIceberg emits a
canonical PyArrow type: for example, `pa.int8()` and `pa.int16()` read as
`IntegerType` and write back as `pa.int32()`, `pa.string()` reads as
`StringType` and writes back as `pa.large_string()`, `pa.binary()` reads as
`BinaryType` and writes back as `pa.large_binary()`, `pa.list_(...)` writes
back as `pa.large_list(...)`, and `pa.timesta [...]