omkenge commented on issue #1200:
URL:
https://github.com/apache/iceberg-python/issues/1200#issuecomment-2640451331
Hello @Fokko
Here is the small Implementation
1. List Data Files in S3
We use PyArrow’s S3FileSystem to retrieve file paths from the given table
location:
def list_data_files_from_table(table_location: str) -> set:
if not table_location.startswith("s3://"):
raise ValueError("Table location must start with 's3://'")
base = table_location.rstrip("/")
data_location = f"{base}/data" if not base.endswith("/data")
else base
s3 = fs.S3FileSystem(
region="eu-central-1",
endpoint_override="127.0.0.1:9000",
access_key="admin",
secret_key="password",
scheme="http"
)
bucket, prefix = data_location[5:].split("/", 1)
selector = fs.FileSelector(f"{bucket}/{prefix}", recursive=True)
file_infos = s3.get_file_info(selector)
return {f"s3://{info.path}" for info in file_infos if info.type
== fs.FileType.File}
2. Extract Metadata-Tracked Files
Using PyIceberg, we retrieve file paths stored in the table metadata:
```
def extract_metadata_files(table) -> set:
metadata_table = table.inspect.files()
return set(metadata_table.column("file_path").to_pylist())
```
3. Identify Orphan Files
```
def find_orphan_files(table_location, table):
s3_files = list_data_files_from_table(table_location)
metadata_files = extract_metadata_files(table)
orphan_files = s3_files - metadata_files # Files in S3 but not in
metadata
return orphan_files
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]