Re: [PR] Add support for `write.data.path` [iceberg-python]

via GitHub Tue, 04 Feb 2025 16:43:34 -0800


smaheshwar-pltr commented on code in PR #1611:
URL: https://github.com/apache/iceberg-python/pull/1611#discussion_r1942091778



##########
tests/table/test_locations.py:
##########
@@ -133,3 +133,13 @@ def test_hash_injection(data_file_name: str, 
expected_hash: str) -> None:
     provider = load_location_provider(table_location="table_location", 
table_properties=EMPTY_DICT)
 
     assert provider.new_data_location(data_file_name) == 
f"table_location/data/{expected_hash}/{data_file_name}"
+
+
+def test_write_data_path() -> None:

Review Comment:
   A test for `SimpleLocationProvider` as well might be nice. I'd separate into 
`test_simple_location_provider_...` and `test_test_object_storage_...` maybe 
like we have in this file.



##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                      | Options                           
| Default | Description                                                         
                                                                                
                                                                |
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                                
                                                                |
-| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
-| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                                
                                                                |
-| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
-| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
-| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                                
                                                                |
-| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
-| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean                           
| True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
-| `write.py-location-provider.impl`        | String of form `module.ClassName` 
| null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| Key                                      | Options                           
 | Default | Description                                                        
                                                                                
                                                                 |
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
 | zstd    | Sets the Parquet compression coddec.                               
                                                                                
                                                                 |
+| `write.parquet.compression-level`        | Integer                           
 | null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
+| `write.parquet.row-group-limit`          | Number of rows                    
 | 1048576 | The upper bound of the number of entries within a single row group 
                                                                                
                                                                 |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
 | 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
+| `write.parquet.page-row-limit`           | Number of rows                    
 | 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
 | 2MB     | Set the dictionary page size limit per row group                   
                                                                                
                                                                 |
+| `write.metadata.previous-versions-max`   | Integer                           
 | 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
+| `write.object-storage.enabled`           | Boolean                           
 | True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean                           
 | True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
+| `write.py-location-provider.impl`        | String of form `module.ClassName` 
 | null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| `write.data.path`                        | String pointing to location       
 | ∅        | Sets the location where to write the data. If not set, it will 
use the table location postfixed with `data/`.                                  
                                                                     |

Review Comment:
   Also bringing attention to 
https://github.com/apache/iceberg-python/pull/1537#discussion_r1921100230 - it 
would be great to make the location providers docs consistent with 
`write.data.path` now.
   
   
https://github.com/apache/iceberg-python/blob/4a055d786ae997921f8ac0c8393f6ae7d4fe41bb/mkdocs/docs/configuration.md?plain=1#L213
   
   
https://github.com/apache/iceberg-python/blob/4a055d786ae997921f8ac0c8393f6ae7d4fe41bb/mkdocs/docs/configuration.md?plain=1#L242
   
   I think these are the only places that require changing - they'll be 
`write.data.path` (and maybe for clarity we should reiterate there that this 
defaults to `{location}/data` so examples are clearer). Could we do that in 
this PR maybe? (Happy to follow-up with this otherwise)



##########
pyiceberg/table/locations.py:
##########
@@ -28,6 +28,8 @@
 
 logger = logging.getLogger(__name__)
 
+WRITE_DATA_PATH = "write.data.path"

Review Comment:
   There's an argument for this, but I'd prefer this be in `TableProperties` 
with the other write table properties (it's what the Java implementation does 
too)



##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                      | Options                           
| Default | Description                                                         
                                                                                
                                                                |
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                                
                                                                |
-| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
-| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                                
                                                                |
-| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
-| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
-| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                                
                                                                |
-| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
-| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean                           
| True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
-| `write.py-location-provider.impl`        | String of form `module.ClassName` 
| null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| Key                                      | Options                           
 | Default | Description                                                        
                                                                                
                                                                 |
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
 | zstd    | Sets the Parquet compression coddec.                               
                                                                                
                                                                 |
+| `write.parquet.compression-level`        | Integer                           
 | null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
+| `write.parquet.row-group-limit`          | Number of rows                    
 | 1048576 | The upper bound of the number of entries within a single row group 
                                                                                
                                                                 |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
 | 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
+| `write.parquet.page-row-limit`           | Number of rows                    
 | 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
 | 2MB     | Set the dictionary page size limit per row group                   
                                                                                
                                                                 |
+| `write.metadata.previous-versions-max`   | Integer                           
 | 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
+| `write.object-storage.enabled`           | Boolean                           
 | True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean                           
 | True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
+| `write.py-location-provider.impl`        | String of form `module.ClassName` 
 | null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| `write.data.path`                        | String pointing to location       
 | ∅        | Sets the location where to write the data. If not set, it will 
use the table location postfixed with `data/`.                                  
                                                                     |

Review Comment:
   Also, what about 'table location + /data' as the default instead of ∅ like 
in the Java docs?



##########
mkdocs/docs/configuration.md:
##########
@@ -54,18 +54,19 @@ Iceberg tables support table properties to configure table 
behavior.
 
 ### Write options
 
-| Key                                      | Options                           
| Default | Description                                                         
                                                                                
                                                                |
-|------------------------------------------|-----------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
| zstd    | Sets the Parquet compression coddec.                                
                                                                                
                                                                |
-| `write.parquet.compression-level`        | Integer                           
| null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
-| `write.parquet.row-group-limit`          | Number of rows                    
| 1048576 | The upper bound of the number of entries within a single row group  
                                                                                
                                                                |
-| `write.parquet.page-size-bytes`          | Size in bytes                     
| 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
-| `write.parquet.page-row-limit`           | Number of rows                    
| 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
-| `write.parquet.dict-size-bytes`          | Size in bytes                     
| 2MB     | Set the dictionary page size limit per row group                    
                                                                                
                                                                |
-| `write.metadata.previous-versions-max`   | Integer                           
| 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
-| `write.object-storage.enabled`           | Boolean                           
| True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
-| `write.object-storage.partitioned-paths` | Boolean                           
| True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
-| `write.py-location-provider.impl`        | String of form `module.ClassName` 
| null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| Key                                      | Options                           
 | Default | Description                                                        
                                                                                
                                                                 |
+|------------------------------------------|------------------------------------|---------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| `write.parquet.compression-codec`        | `{uncompressed,zstd,gzip,snappy}` 
 | zstd    | Sets the Parquet compression coddec.                               
                                                                                
                                                                 |
+| `write.parquet.compression-level`        | Integer                           
 | null    | Parquet compression level for the codec. If not set, it is up to 
PyIceberg                                                                       
                                                                   |
+| `write.parquet.row-group-limit`          | Number of rows                    
 | 1048576 | The upper bound of the number of entries within a single row group 
                                                                                
                                                                 |
+| `write.parquet.page-size-bytes`          | Size in bytes                     
 | 1MB     | Set a target threshold for the approximate encoded size of data 
pages within a column chunk                                                     
                                                                    |
+| `write.parquet.page-row-limit`           | Number of rows                    
 | 20000   | Set a target threshold for the maximum number of rows within a 
column chunk                                                                    
                                                                     |
+| `write.parquet.dict-size-bytes`          | Size in bytes                     
 | 2MB     | Set the dictionary page size limit per row group                   
                                                                                
                                                                 |
+| `write.metadata.previous-versions-max`   | Integer                           
 | 100     | The max number of previous version metadata files to keep before 
deleting after commit.                                                          
                                                                   |
+| `write.object-storage.enabled`           | Boolean                           
 | True    | Enables the 
[`ObjectStoreLocationProvider`](configuration.md#object-store-location-provider)
 that adds a hash component to file paths. Note: the default value of `True` 
differs from Iceberg's Java implementation |
+| `write.object-storage.partitioned-paths` | Boolean                           
 | True    | Controls whether [partition values are included in file 
paths](configuration.md#partition-exclusion) when object storage is enabled     
                                                                            |
+| `write.py-location-provider.impl`        | String of form `module.ClassName` 
 | null    | Optional, [custom 
`LocationProvider`](configuration.md#loading-a-custom-location-provider) 
implementation                                                                  
                                         |
+| `write.data.path`                        | String pointing to location       
 | ∅        | Sets the location where to write the data. If not set, it will 
use the table location postfixed with `data/`.                                  
                                                                     |

Review Comment:
   Maybe something like:
   
   ```suggestion
   | `write.data.path`                        | String referencing a location   
     | ∅        | Sets the location under which data is written. If not set, 
the table location (`table.metadata.location`) postfixed with `data/` is used.  
                                                                                
                     |
   ```



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Add support for `write.data.path` [iceberg-python]

Reply via email to