sjvanrossum commented on code in PR #34135:
URL: https://github.com/apache/beam/pull/34135#discussion_r1979513786
##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -351,6 +352,9 @@ class BigQueryWrapper(object):
TEMP_DATASET = 'beam_temp_dataset_'
HISTOGRAM_METRIC_LOGGER = MetricLogger()
+
+ # Default TTL for cached table definitions in seconds
+ DEFAULT_TABLE_DEFINITION_TTL = 3600 # 1 hour
Review Comment:
If I'm not mistaken, BigQueryIO's caches in the Java SDK supply TTL=1s.
TTL=1h is not suitable for streaming pipeline since table updates would take
a long time to propagate.
Considering the other comments about cache_tools and cache lifetimes, could
you leave a few suggestions to improve this?
##########
sdks/python/apache_beam/io/gcp/bigquery_tools.py:
##########
@@ -386,8 +390,28 @@ def __init__(self, client=None, temp_dataset_id=None,
temp_table_ref=None):
self._temporary_table_suffix = uuid.uuid4().hex
self.temp_dataset_id = temp_dataset_id or self._get_temp_dataset()
+ # Initialize table definition cache with default TTL of 1 hour
+ # Cache entries are invalidated after TTL expires to ensure fresh metadata
+ self._table_cache = {}
Review Comment:
This cache container's lifetime scope is per instance and will be
instantiated per harness thread resulting in many individual caches.
If I'm not mistaken this also means that the locks are of no added use right
now.
The suggested alternatives (functools, cachetools) supply thread-safe
implementations and should be used instead.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]