Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

via GitHub Tue, 04 Mar 2025 06:24:30 -0800


Strikerrx01 commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979567570



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -386,8 +390,28 @@ def __init__(self, client=None, temp_dataset_id=None, 
temp_table_ref=None):
       self._temporary_table_suffix = uuid.uuid4().hex
       self.temp_dataset_id = temp_dataset_id or self._get_temp_dataset()
 
+    # Initialize table definition cache with default TTL of 1 hour
+    # Cache entries are invalidated after TTL expires to ensure fresh metadata
+    self._table_cache = {}

Review Comment:
   @sjvanrossum Thanks for catching this important issue about cache scope and 
thread safety. You're absolutely right that:
   
   1. The current per-instance cache with harness thread instantiation could 
lead to multiple caches
   2. The locks aren't providing much value in the current implementation
   3. We should use thread-safe implementations from functools/cachetools
   
   I'll modify the implementation to:
   1. Use `cachetools.TTLCache` which provides thread-safe caching out of the 
box
   2. Remove the redundant lock mechanism since cachetools handles thread safety
   3. Ensure the cache is properly shared across harness threads
   
   Would you like me to show you the updated implementation using cachetools 
before I commit the changes?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

Reply via email to