smaheshwar-pltr commented on code in PR #1611:
URL: https://github.com/apache/iceberg-python/pull/1611#discussion_r1942091778
##########
tests/table/test_locations.py:
##########
@@ -133,3 +133,13 @@ def test_hash_injection(data_file_name: str,
expected_hash: str) -> None:
provider = load_location_provider(table_location="table_location",
table_properties=EMPTY_DICT)
assert provider.new_data_location(data_file_name) ==
f"table_location/data/{expected_hash}/{data_file_name}"
+
+
+def test_write_data_path() -> None:
Review Comment:
A test for `SimpleLocationProvider` as well might be nice. I'd separate into
`test_simple_location_provider_...` and `test_test_object_storage_...` maybe
like we have in this file.
##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table
behavior.
### Write options
-| Key | Options
| Default | Description
|
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
-| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
-| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
-| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
-| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
-| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
-| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
-| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
-| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| Key | Options
| Default | Description
|
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
+| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
+| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
+| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
+| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
+| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
+| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
+| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
+| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| `write.data.path` | String pointing to location
| ∅ | Sets the location where to write the data. If not set, it will
use the table location postfixed with `data/`.
|
Review Comment:
Also bringing attention to
https://github.com/apache/iceberg-python/pull/1537#discussion_r1921100230 - it
would be great to make the location providers docs consistent with
`write.data.path` now.
https://github.com/apache/iceberg-python/blob/4a055d786ae997921f8ac0c8393f6ae7d4fe41bb/mkdocs/docs/configuration.md?plain=1#L213
https://github.com/apache/iceberg-python/blob/4a055d786ae997921f8ac0c8393f6ae7d4fe41bb/mkdocs/docs/configuration.md?plain=1#L242
I think these are the only places that require changing - they'll be
`write.data.path` (and maybe for clarity we should reiterate there that this
defaults to `{location}/data` so examples are clearer). Could we do that in
this PR maybe? (Happy to follow-up with this otherwise)
##########
pyiceberg/table/locations.py:
##########
@@ -28,6 +28,8 @@
logger = logging.getLogger(__name__)
+WRITE_DATA_PATH = "write.data.path"
Review Comment:
There's an argument for this, but I'd prefer this be in `TableProperties`
with the other write table properties (it's what the Java implementation does
too)
##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table
behavior.
### Write options
-| Key | Options
| Default | Description
|
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
-| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
-| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
-| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
-| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
-| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
-| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
-| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
-| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| Key | Options
| Default | Description
|
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
+| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
+| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
+| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
+| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
+| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
+| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
+| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
+| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| `write.data.path` | String pointing to location
| ∅ | Sets the location where to write the data. If not set, it will
use the table location postfixed with `data/`.
|
Review Comment:
Also, what about 'table location + /data' as the default instead of ∅ like
in the Java docs?
##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table
behavior.
### Write options
-| Key | Options
| Default | Description
|
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
-| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
-| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
-| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
-| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
-| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
-| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
-| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
-| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| Key | Options
| Default | Description
|
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec` | `{uncompressed,zstd,gzip,snappy}`
| zstd | Sets the Parquet compression coddec.
|
+| `write.parquet.compression-level` | Integer
| null | Parquet compression level for the codec. If not set, it is up to
PyIceberg
|
+| `write.parquet.row-group-limit` | Number of rows
| 1048576 | The upper bound of the number of entries within a single row group
|
+| `write.parquet.page-size-bytes` | Size in bytes
| 1MB | Set a target threshold for the approximate encoded size of data
pages within a column chunk
|
+| `write.parquet.page-row-limit` | Number of rows
| 20000 | Set a target threshold for the maximum number of rows within a
column chunk
|
+| `write.parquet.dict-size-bytes` | Size in bytes
| 2MB | Set the dictionary page size limit per row group
|
+| `write.metadata.previous-versions-max` | Integer
| 100 | The max number of previous version metadata files to keep before
deleting after commit.
|
+| `write.object-storage.enabled` | Boolean
| True | Enables the
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
that adds a hash component to file paths. Note: the default value of `True`
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean
| True | Controls whether [partition values are included in file
paths](configuration.md#partition-exclusion) when object storage is enabled
|
+| `write.py-location-provider.impl` | String of form `module.ClassName`
| null | Optional, [custom
`LocationProvider`](configuration.md#loading-a-custom-location-provider)
implementation
|
+| `write.data.path` | String pointing to location
| ∅ | Sets the location where to write the data. If not set, it will
use the table location postfixed with `data/`.
|
Review Comment:
Maybe something like:
```suggestion
| `write.data.path` | String referencing a location
| ∅ | Sets the location under which data is written. If not set,
the table location (`table.metadata.location`) postfixed with `data/` is used.
|
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]