Fokko commented on issue #6475:
URL: https://github.com/apache/iceberg/issues/6475#issuecomment-1365733458
It looks like PyArrow is still doing more requests than s3fs.
I've created a local table of taxis in PyArrow:
```sql
%%sql
CREATE DATABASE nyc;
%%sql
CREATE TABLE nyc.taxis (
VendorID bigint,
tpep_pickup_datetime timestamp,
tpep_dropoff_datetime timestamp,
passenger_count double,
trip_distance double,
RatecodeID double,
store_and_fwd_flag string,
PULocationID bigint,
DOLocationID bigint,
payment_type bigint,
fare_amount double,
extra double,
mta_tax double,
tip_amount double,
tolls_amount double,
improvement_surcharge double,
total_amount double,
congestion_surcharge double,
airport_fee double
)
USING iceberg
PARTITIONED BY (days(tpep_pickup_datetime))
```
```python
%%python
# Loop over it to avoid OOM, otherwise *.parquet would also work (and be
more efficient)
for filename in [
"yellow_tripdata_2022-04.parquet",
"yellow_tripdata_2022-03.parquet",
"yellow_tripdata_2022-02.parquet",
"yellow_tripdata_2022-01.parquet",
"yellow_tripdata_2021-12.parquet",
"yellow_tripdata_2021-11.parquet",
"yellow_tripdata_2021-10.parquet",
"yellow_tripdata_2021-09.parquet",
"yellow_tripdata_2021-08.parquet"
]:
df = spark.read.parquet(f"/home/iceberg/data/{filename}")
df.write.mode("append").saveAsTable("nyc.taxis")
```
Looking at the minio requests when running `pyiceberg --catalog local files
nyc.taxis`:
```
Snapshots: local.nyc.taxis
└── Snapshot 6682082212753545990, schema 0:
s3a://warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
│ ...
├── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
│ ...
└── Manifest:
s3a://warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
...
```
### PyArrow
```
2022-12-27T08:45:32.822 [206 Partial Content] s3.GetObject
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
172.18.0.3 1.142ms ↑ 169 B ↓ 14 KiB
2022-12-27T08:45:32.913 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
172.18.0.1 867µs ↑ 153 B ↓ 412 B
2022-12-27T08:45:32.925 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
172.18.0.1 1.626ms ↑ 159 B ↓ 4.6 KiB
2022-12-27T08:45:32.973 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 1.216ms ↑ 153 B ↓ 413 B
2022-12-27T08:45:32.989 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 3.719ms ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.020 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 3.904ms ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.042 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 1.903ms ↑ 159 B ↓ 1.7 KiB
2022-12-27T08:45:33.104 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
172.18.0.1 1.232ms ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.113 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
172.18.0.1 683µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.120 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
172.18.0.1 975µs ↑ 159 B ↓ 7.0 KiB
2022-12-27T08:45:33.141 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
172.18.0.1 383µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.144 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
172.18.0.1 774µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.148 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
172.18.0.1 833µs ↑ 159 B ↓ 7.4 KiB
2022-12-27T08:45:33.170 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
172.18.0.1 432µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.173 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
172.18.0.1 1.208ms ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.178 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
172.18.0.1 814µs ↑ 159 B ↓ 8.2 KiB
2022-12-27T08:45:33.202 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
172.18.0.1 427µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.205 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
172.18.0.1 671µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.209 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
172.18.0.1 502µs ↑ 159 B ↓ 7.9 KiB
2022-12-27T08:45:33.233 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
172.18.0.1 616µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.236 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
172.18.0.1 955µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.240 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
172.18.0.1 934µs ↑ 159 B ↓ 7.4 KiB
2022-12-27T08:45:33.262 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
172.18.0.1 308µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.265 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
172.18.0.1 641µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.269 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
172.18.0.1 831µs ↑ 159 B ↓ 7.6 KiB
2022-12-27T08:45:33.295 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
172.18.0.1 625µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.298 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
172.18.0.1 828µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.302 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
172.18.0.1 897µs ↑ 159 B ↓ 7.8 KiB
2022-12-27T08:45:33.324 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
172.18.0.1 474µs ↑ 153 B ↓ 413 B
2022-12-27T08:45:33.326 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
172.18.0.1 644µs ↑ 159 B ↓ 8.5 KiB
2022-12-27T08:45:33.330 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
172.18.0.1 904µs ↑ 159 B ↓ 7.1 KiB
```
### S3FS
```
2022-12-27T09:05:13.127 [206 Partial Content] s3.GetObject
minio:9000/warehouse/wh/nyc/taxis/metadata/00009-3dd7ff4a-a81f-4780-bee1-de9335490c20.metadata.json
172.18.0.3 1.167ms ↑ 169 B ↓ 14 KiB
2022-12-27T09:05:13.245 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
172.18.0.1 4.86ms ↑ 138 B ↓ 412 B
2022-12-27T09:05:13.258 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/snap-6682082212753545990-1-9f900c62-59ca-4ea1-ae75-e373d113b036.avro
172.18.0.1 860µs ↑ 153 B ↓ 4.6 KiB
2022-12-27T09:05:13.265 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 282µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.268 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/9f900c62-59ca-4ea1-ae75-e373d113b036-m0.avro
172.18.0.1 943µs ↑ 153 B ↓ 18 KiB
2022-12-27T09:05:13.297 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
172.18.0.1 472µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.300 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/f4d30f09-ebf1-4339-ae9f-652739d4e0bc-m0.avro
172.18.0.1 907µs ↑ 153 B ↓ 15 KiB
2022-12-27T09:05:13.320 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
172.18.0.1 349µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.323 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/3fb110ad-0b08-40d7-8e02-ba86192cfc8d-m0.avro
172.18.0.1 779µs ↑ 153 B ↓ 15 KiB
2022-12-27T09:05:13.348 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
172.18.0.1 346µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.351 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/6300efd8-fa3a-40ea-bc77-8c128e04d83d-m0.avro
172.18.0.1 681µs ↑ 153 B ↓ 16 KiB
2022-12-27T09:05:13.374 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
172.18.0.1 291µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.377 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/27187379-514d-42f4-aada-ade2b239b41a-m0.avro
172.18.0.1 771µs ↑ 153 B ↓ 16 KiB
2022-12-27T09:05:13.399 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
172.18.0.1 375µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.403 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/29b8cfd9-38a8-48ac-aabd-ede8b4118d3d-m0.avro
172.18.0.1 789µs ↑ 153 B ↓ 15 KiB
2022-12-27T09:05:13.431 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
172.18.0.1 373µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.434 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/37f92724-8248-43c2-b8c7-cd40e2da21cf-m0.avro
172.18.0.1 650µs ↑ 153 B ↓ 16 KiB
2022-12-27T09:05:13.457 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
172.18.0.1 308µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.460 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/02096f9b-1056-43ed-85af-871a47ad87f8-m0.avro
172.18.0.1 753µs ↑ 153 B ↓ 16 KiB
2022-12-27T09:05:13.482 [200 OK] s3.HeadObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
172.18.0.1 279µs ↑ 138 B ↓ 413 B
2022-12-27T09:05:13.488 [206 Partial Content] s3.GetObject
127.0.0.1:9000/warehouse/wh/nyc/taxis/metadata/5a44ffb0-64ac-4b4d-9896-a53c52d6c30b-m0.avro
172.18.0.1 885µs ↑ 153 B ↓ 15 KiB
```
We can observe that PyArrow does two calls call to the Avro file, and s3fs
just one.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]