geruh commented on code in PR #2993:
URL: https://github.com/apache/iceberg-python/pull/2993#discussion_r2933243353
##########
pyiceberg/manifest.py:
##########
@@ -894,10 +895,27 @@ def __hash__(self) -> int:
# Global cache for ManifestFile objects, keyed by manifest_path.
# This deduplicates ManifestFile objects across manifest lists, which commonly
# share manifests after append operations.
-_manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=128)
+_DEFAULT_MANIFEST_CACHE_SIZE = 128
+_configured_manifest_cache_size = Config().get_int("manifest-cache-size")
+_manifest_cache_size = (
+ _configured_manifest_cache_size if _configured_manifest_cache_size is not
None else _DEFAULT_MANIFEST_CACHE_SIZE
+)
-# Lock for thread-safe cache access
+# Lock for thread-safe cache access.
_manifest_cache_lock = threading.RLock()
+_manifest_cache: LRUCache[str, ManifestFile] | dict[str, ManifestFile] = (
Review Comment:
Do we even need this map | lru complexity if you're returning early below?
##########
mkdocs/docs/configuration.md:
##########
@@ -48,6 +48,21 @@ The environment variable picked up by Iceberg starts with
`PYICEBERG_` and then
For example, `PYICEBERG_CATALOG__DEFAULT__S3__ACCESS_KEY_ID`, sets
`s3.access-key-id` on the `default` catalog.
+## Manifest Caching
+
+PyIceberg caches `ManifestFile` objects locally and uses an LRU policy to
bound the cache size. By default, up to `128`
+manifest entries are retained.
+
+You can change the cache size with the `PYICEBERG_MANIFEST_CACHE_SIZE`
environment variable:
Review Comment:
I think it'd be helpful to frame this more as a config and configs are
tunable by the environment variable or the `.pyiceberg.yaml`. I guess this kind
of just makes it seem like you can only tune this number with an environment
variable.
Going off of what you have below:
`Config().get_int("manifest-cache-size")`
##########
pyiceberg/manifest.py:
##########
@@ -927,6 +945,9 @@ def _manifests(io: FileIO, manifest_list: str) ->
tuple[ManifestFile, ...]:
file = io.new_input(manifest_list)
manifest_files = list(read_manifest_list(file))
+ if _manifest_cache_size == 0:
Review Comment:
We will see a cahce failure if someone incidentally sets cache to negative
value
##########
pyiceberg/manifest.py:
##########
@@ -894,10 +895,27 @@ def __hash__(self) -> int:
# Global cache for ManifestFile objects, keyed by manifest_path.
# This deduplicates ManifestFile objects across manifest lists, which commonly
# share manifests after append operations.
-_manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=128)
+_DEFAULT_MANIFEST_CACHE_SIZE = 128
+_configured_manifest_cache_size = Config().get_int("manifest-cache-size")
+_manifest_cache_size = (
+ _configured_manifest_cache_size if _configured_manifest_cache_size is not
None else _DEFAULT_MANIFEST_CACHE_SIZE
+)
-# Lock for thread-safe cache access
+# Lock for thread-safe cache access.
Review Comment:
nit: on the cosmetic comment change here
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]