geruh commented on code in PR #2993:
URL: https://github.com/apache/iceberg-python/pull/2993#discussion_r2837096276


##########
pyiceberg/manifest.py:
##########
@@ -891,15 +892,32 @@ def __hash__(self) -> int:
         return hash(self.manifest_path)
 
 
-# Global cache for ManifestFile objects, keyed by manifest_path.
-# This deduplicates ManifestFile objects across manifest lists, which commonly
-# share manifests after append operations.
-_manifest_cache: LRUCache[str, ManifestFile] = LRUCache(maxsize=128)
-
-# Lock for thread-safe cache access
+_DEFAULT_MANIFEST_CACHE_SIZE = 128
 _manifest_cache_lock = threading.RLock()
 
 
+def _init_manifest_cache() -> LRUCache[str, ManifestFile] | None:

Review Comment:
   I think we can still simplify this to:
   
   ```python
   _manifest_cache_size = Config().get_int("manifest-cache-size") or 
_DEFAULT_MANIFEST_CACHE_SIZE
   _manifest_cache: LRUCache[str, ManifestFile] = 
LRUCache(maxsize=_manifest_cache_size)
   ```
   
   It someone really wants to disable caching. All I gotta do is just set this 
value to something like 1. The overhead here is pretty negligible. Unless there 
is a strong opinion for a case in which we never allocate a lrucache instance...



##########
tests/utils/test_manifest.py:
##########
@@ -46,9 +48,10 @@
 
 
 @pytest.fixture(autouse=True)
-def clear_global_manifests_cache() -> None:
-    # Clear the global cache before each test
-    _manifest_cache.clear()
+def reset_global_manifests_cache() -> None:
+    with manifest_module._manifest_cache_lock:
+        manifest_module._manifest_cache = 
manifest_module._init_manifest_cache()
+    clear_manifest_cache()

Review Comment:
   Depending on the init functionality. We can probably remove the clear since 
it's a no-op after calling your current init function.



##########
pyiceberg/manifest.py:
##########
@@ -927,14 +945,18 @@ def _manifests(io: FileIO, manifest_list: str) -> 
tuple[ManifestFile, ...]:
     file = io.new_input(manifest_list)
     manifest_files = list(read_manifest_list(file))
 
+    if _manifest_cache is None:
+        return tuple(manifest_files)
+
     result = []
     with _manifest_cache_lock:
+        cache = _manifest_cache

Review Comment:
   nit: do we need this variable here, adds a bit of noise with the change



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to