Re: [PR] Small getting started guide on writes [iceberg-python]

via GitHub Mon, 29 Jan 2024 00:54:49 -0800


Fokko commented on code in PR #311:
URL: https://github.com/apache/iceberg-python/pull/311#discussion_r1469257727



##########
mkdocs/docs/index.md:
##########
@@ -38,36 +38,129 @@ You can install the latest release version from pypi:
 pip install "pyiceberg[s3fs,hive]"
 ```
 
-Install it directly for GitHub (not recommended), but sometimes handy:
+You can mix and match optional dependencies depending on your needs:
+
+| Key          | Description:                                                  
       |
+| ------------ | 
-------------------------------------------------------------------- |
+| hive         | Support for the Hive metastore                                
       |
+| glue         | Support for AWS Glue                                          
       |
+| dynamodb     | Support for AWS DynamoDB                                      
       |
+| sql-postgres | Support for SQL Catalog backed by Postgresql                  
       |
+| sql-sqlite   | Support for SQL Catalog backed by SQLite                      
       |
+| pyarrow      | PyArrow as a FileIO implementation to interact with the 
object store |
+| pandas       | Installs both PyArrow and Pandas                              
       |
+| duckdb       | Installs both PyArrow and DuckDB                              
       |
+| ray          | Installs PyArrow, Pandas, and Ray                             
       |
+| s3fs         | S3FS as a FileIO implementation to interact with the object 
store    |
+| adlfs        | ADLFS as a FileIO implementation to interact with the object 
store   |
+| snappy       | Support for snappy Avro compression                           
       |
+| gcs          | GCS as the FileIO implementation to interact with the object 
store   |
+
+You either need to install `s3fs`, `adlfs`, `gcs`, or `pyarrow` to be able to 
fetch files from an object store.
+
+## Connecting to a catalog
+
+Iceberg leverages the [catalog to have one centralized place to organize the 
tables](https://iceberg.apache.org/catalog/). This can be a traditional Hive 
catalog to store your Iceberg tables next to the rest, a vendor solution like 
the AWS Glue catalog, or an implementation of Icebergs' own [REST 
protocol](https://github.com/apache/iceberg/tree/main/open-api). Checkout the 
[configuration](configuration.md) page to find all the configuration details.
 
+## Write a PyArrow dataframe
+
+Let's take the Taxi dataset, and write this to an Iceberg table.
+
+First download one month of data:
+
+```shell
+curl 
https://d37ci6vzurychx.cloudfront.net/trip-data/yellow_tripdata_2023-01.parquet 
-o /tmp/yellow_tripdata_2023-01.parquet
 ```
-pip install 
"git+https://github.com/apache/iceberg-python.git#egg=pyiceberg[s3fs]";
+
+Load it into your PyArrow dataframe:
+
+```python
+import pyarrow.parquet as pq
+
+df = pq.read_table("/tmp/yellow_tripdata_2023-01.parquet")
 ```
 
-Or clone the repository for local development:
+Create a new Iceberg table:
 
-```sh
-git clone https://github.com/apache/iceberg-python.git
-cd iceberg-python
-pip3 install -e ".[s3fs,hive]"
+```python
+from pyiceberg.catalog import load_catalog
+
+catalog = load_catalog("default")
+
+table = catalog.create_table(
+    "default.taxi_dataset",
+    schema=df.schema,  # Blocked by 
https://github.com/apache/iceberg-python/pull/305
+)
 ```
 
-You can mix and match optional dependencies depending on your needs:
+Append the dataframe to the table:
+
+```python
+table.append(df)
+len(table.scan().to_arrow())
+```
+
+3066766 rows have been written to the table.
+
+Now generate a tip-per-mile feature to train the model on:
+
+```python
+import pyarrow.compute as pc
+
+df = df.append_column("tip_per_mile", pc.divide(df["tip_amount"], 
df["trip_distance"]))
+```
+
+Evolve the schema of the table with the new column:
+
+```python
+from pyiceberg.catalog import Catalog
+
+with table.update_schema() as update_schema:
+    # Blocked by https://github.com/apache/iceberg-python/pull/305
+    update_schema.union_by_name(Catalog._convert_schema_if_needed(df.schema))

Review Comment:
   I fully agree. I should have been more explicit in the comment above. Once 
#305 is in, we can update the `union_by_name` to also accept `pa.Schema`. WDYT?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] Small getting started guide on writes [iceberg-python]

Reply via email to