Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

via GitHub Tue, 04 Mar 2025 06:17:12 -0800


sjvanrossum commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979513786



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -351,6 +352,9 @@ class BigQueryWrapper(object):
   TEMP_DATASET = 'beam_temp_dataset_'
 
   HISTOGRAM_METRIC_LOGGER = MetricLogger()
+  
+  # Default TTL for cached table definitions in seconds
+  DEFAULT_TABLE_DEFINITION_TTL = 3600  # 1 hour

Review Comment:
   If I'm not mistaken, BigQueryIO's caches in the Java SDK supply TTL=1s.
   TTL=1h is not suitable for streaming pipeline since table updates would take 
a long time to propagate.
   Considering the other comments about cache_tools and cache lifetimes, could 
you leave a few suggestions to improve this?



##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -386,8 +390,28 @@ def __init__(self, client=None, temp_dataset_id=None, 
temp_table_ref=None):
       self._temporary_table_suffix = uuid.uuid4().hex
       self.temp_dataset_id = temp_dataset_id or self._get_temp_dataset()
 
+    # Initialize table definition cache with default TTL of 1 hour
+    # Cache entries are invalidated after TTL expires to ensure fresh metadata
+    self._table_cache = {}

Review Comment:
   This cache container's lifetime scope is per instance and will be 
instantiated per harness thread resulting in many individual caches.
   If I'm not mistaken this also means that the locks are of no added use right 
now.
   The suggested alternatives (functools, cachetools) supply thread-safe 
implementations and should be used instead.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [Python] Add caching for BigQuery table definitions [beam]

Reply via email to