jkleinkauff commented on issue #1032:
URL:
https://github.com/apache/iceberg-python/issues/1032#issuecomment-2278564297
Hey, thank you for taking a time to answer me!
1. My files are in S3.
2. Sure! It's something I could do on my end? Do you have any recommendation
on that?
(I'm not sure if it's the same, running a download profiler (I've written
something using psutil) in on file it takes something between 25s to complete
it)
Yeah, even with limit=1 it seems scan is returning both files (just an
observation, maybe it's intended):
```python
df = table.scan(limit=1)
# pa_table = df.to_arrow()
[print(task.file.file_path) for task in df.plan_files()]
#
s3://xxx/xxx/curitiba_starts_june/data/00000-0-6984da88-fe64-4765-9137-739072becfb1.parquet
#
s3://xxx/xxx/curitiba_starts_june/data/00000-0-1de29b8f-2e8c-4543-9663-f769d53b17b7.parquet
```
Output of table.inspect.manifests().to_pandas()
```python
❯ python pyiceberg_duckdb.py
content path length
partition_spec_id ... added_delete_files_count existing_delete_files_count
deleted_delete_files_count partition_summaries
0 0 s3://data-lake-jho/bronze/curitiba_starts_june... 10433
0 ... 0 0
0 []
1 0 s3://data-lake-jho/bronze/curitiba_starts_june... 10430
0 ... 0 0
0 []
[2 rows x 12 columns]
```
I can also share the files or a direct link to my files. Thank you!
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]