DieHertz commented on issue #1229:
URL:
https://github.com/apache/iceberg-python/issues/1229#issuecomment-2428451067
So I haven't tried any actual changes yet, but decided to collect some
baseline measurements with py-spy.
First there's pyiceberg 7.0.1 `.inspect.files()` on my big phat table:
```python
In [3]: start = time.time()
...: f = m.inspect.files()
...: print('elapsed', time.time() - start)
elapsed 110.91524386405945
In [4]: len(f)
Out[4]: 401188
In [5]: len(m.current_snapshot().manifests(m.io))
Out[5]: 688
```

It can be seen that most of the time is spent in the `AvroFile` constructor,
where some initial decoding occurs, and inside the list comprehension for
manifest entries, where the Avro records get transformed into dicts.
I argue that this is a CPU load rather than IO, and to prove that
conclusively I run the same code with a quickly-crafted memoized
Snapshot/Manifest:
```python
import pyiceberg
from concurrent.futures import ThreadPoolExecutor
class IOFromBytes:
def __init__(self, bytes_: bytes):
self._bytes = bytes_
def open(self):
return self
def __enter__(self):
return self
def __exit__(self, a, b, c):
...
def read(self):
return self._bytes
def new_input(self, *args, **kwargs):
return self
class MemoryManifest:
def __init__(self, manifest, io):
self._manifest = manifest
with io.new_input(manifest.manifest_path).open() as f:
self._io = IOFromBytes(f.read())
def fetch_manifest_entry(self, *args, **kwargs):
return self._manifest.fetch_manifest_entry(self._io, **kwargs)
class MemorySnapshot:
def __init__(self, table: pyiceberg.table.Table):
with ThreadPoolExecutor() as pool:
self._manifests = list(pool.map(
lambda manifest: MemoryManifest(manifest, table.io),
table.current_snapshot().manifests(table.io),
))
def manifests(self, *args, **kwargs):
return self._manifests
```
Now we can see the actual IO takes less than 1 second for a total of ~112
MiB (without `ThreadPoolExecutor` here it was closer to 11 seconds):
```python
In [35]: start = time.time()
...: snapshot = MemorySnapshot(m)
...: print('elapsed', time.time() - start)
elapsed 0.4690868854522705
In [37]: len(snapshot._manifests)
Out[37]: 688
In [39]: sum(len(manifest._io._bytes) for manifest in snapshot._manifests) /
1024 / 1024
Out[39]: 112.21551609039307
```
Now `.inspect.files()` over already downloaded data:
```python
In [36]: start = time.time()
... m.inspect._get_snapshot = lambda self_: snapshot
...: f = m.inspect.files()
...: print('elapsed', time.time() - start)
elapsed 97.30642795562744
In [38]: len(f)
Out[38]: 401188
```

It can be seen that IO takes a little more than 10% of the total time taken
by `.inspect.files()`, and that's about it for the improvement I'm expecting to
get if we use just the `ThreadPoolExecutor`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]