korbit-ai[bot] commented on code in PR #35923: URL: https://github.com/apache/superset/pull/35923#discussion_r2480334063
########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None Review Comment: ### WebDriver instance persistence causes resource leaks <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? WebDriverSelenium now stores a WebDriver instance as an instance variable, creating a persistent connection that may not be properly cleaned up if the destructor fails. ###### Why this matters This creates resource leaks when WebDriver instances are not properly closed, leading to accumulating browser processes and memory consumption over time, especially in long-running applications or high-frequency screenshot operations. ###### Suggested change ∙ *Feature Preview* Implement a context manager pattern or ensure explicit cleanup methods are called. Consider using a factory pattern that creates and destroys drivers per operation rather than persisting them: ```python class WebDriverSelenium(WebDriverProxy): def get_screenshot(self, url: str, element_name: str, user: User | None = None) -> bytes | None: driver = None try: driver = self._create() driver.set_window_size(*self._window) if user: self._auth(user, driver) # ... screenshot logic finally: if driver: self._destroy(driver) ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f6a697a2-78e8-440d-a6cd-d7ad773144ac/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f6a697a2-78e8-440d-a6cd-d7ad773144ac?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f6a697a2-78e8-440d-a6cd-d7ad773144ac?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f6a697a2-78e8-440d-a6cd-d7ad773144ac?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f6a697a2-78e8-440d-a6cd-d7ad773144ac) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:1d34c595-2008-4522-a672-f904725371ba --> [](1d34c595-2008-4522-a672-f904725371ba) ########## superset/config.py: ########## @@ -837,6 +837,9 @@ class D3TimeFormat(TypedDict, total=False): } THUMBNAIL_ERROR_CACHE_TTL = int(timedelta(days=1).total_seconds()) +# Cache warmup user +SUPERSET_CACHE_WARMUP_USER = "admin" Review Comment: ### Hardcoded admin user may not exist <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The hardcoded default value "admin" for SUPERSET_CACHE_WARMUP_USER may cause runtime failures if no user with username "admin" exists in the system. ###### Why this matters If the system doesn't have a user named "admin", cache warmup operations will fail when trying to authenticate or impersonate this non-existent user, breaking the cache warmup functionality entirely. ###### Suggested change ∙ *Feature Preview* Either make this configurable via environment variable with a fallback, add validation to ensure the user exists, or use a more robust default mechanism: ```python # Cache warmup user SUPERSET_CACHE_WARMUP_USER = os.environ.get("SUPERSET_CACHE_WARMUP_USER", "admin") ``` Or add a comment explaining the requirement: ```python # Cache warmup user - ensure this user exists in your system # or override in superset_config.py SUPERSET_CACHE_WARMUP_USER = "admin" ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b08b6d89-00fc-493f-ac66-c0bbed23628e/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b08b6d89-00fc-493f-ac66-c0bbed23628e?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b08b6d89-00fc-493f-ac66-c0bbed23628e?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b08b6d89-00fc-493f-ac66-c0bbed23628e?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/b08b6d89-00fc-493f-ac66-c0bbed23628e) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:b34b020f-f527-483b-9763-dc3a13c570f8 --> [](b34b020f-f527-483b-9763-dc3a13c570f8) ########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None + + def __del__(self) -> None: + self._destroy() Review Comment: ### Unreliable WebDriver cleanup in destructor <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The __del__ method in WebDriverSelenium relies on Python's garbage collection to clean up WebDriver resources, which is unreliable and may cause resource leaks. ###### Why this matters Python's __del__ method is not guaranteed to be called promptly or at all, which can lead to WebDriver processes remaining active and consuming system resources, especially in long-running applications or when exceptions occur. ###### Suggested change ∙ *Feature Preview* Implement a context manager or explicit cleanup method instead of relying on __del__: ```python def close(self) -> None: """Explicitly close the WebDriver""" self._destroy() def __enter__(self): return self def __exit__(self, exc_type, exc_val, exc_tb): self.close() ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/c4c684ad-595f-48ed-a19e-77724149ae73/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/c4c684ad-595f-48ed-a19e-77724149ae73?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/c4c684ad-595f-48ed-a19e-77724149ae73?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/c4c684ad-595f-48ed-a19e-77724149ae73?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/c4c684ad-595f-48ed-a19e-77724149ae73) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:e6a10a4f-ba6d-4c78-ab80-88df9da51016 --> [](e6a10a4f-ba6d-4c78-ab80-88df9da51016) ########## superset/tasks/cache.py: ########## @@ -118,8 +109,12 @@ class DummyStrategy(Strategy): # pylint: disable=too-few-public-methods name = "dummy" - def get_tasks(self) -> list[CacheWarmupTask]: - return [get_task(chart) for chart in db.session.query(Slice).all()] + def get_urls(self) -> list[str]: + dashboards = ( + db.session.query(Dashboard).filter(Dashboard.published.is_(True)).all() + ) + + return [get_dash_url(dashboard) for dashboard in dashboards if dashboard.slices] Review Comment: ### Memory inefficient dashboard loading in DummyStrategy <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? Loading all published dashboards into memory at once using .all() can cause excessive memory usage with large datasets. ###### Why this matters This approach loads the entire result set into memory simultaneously, which can cause memory pressure and poor performance when dealing with thousands of dashboards. ###### Suggested change ∙ *Feature Preview* Use lazy evaluation or batch processing to avoid loading all dashboards at once: ```python def get_urls(self) -> list[str]: urls = [] dashboards = ( db.session.query(Dashboard) .filter(Dashboard.published.is_(True)) .yield_per(100) # Process in batches ) for dashboard in dashboards: if dashboard.slices: urls.append(get_dash_url(dashboard)) return urls ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f410df69-a8bd-41cb-8eae-c3eea83e082b/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f410df69-a8bd-41cb-8eae-c3eea83e082b?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f410df69-a8bd-41cb-8eae-c3eea83e082b?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f410df69-a8bd-41cb-8eae-c3eea83e082b?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/f410df69-a8bd-41cb-8eae-c3eea83e082b) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:dca0142a-57b2-4b2f-8525-b8a32e87eb53 --> [](dca0142a-57b2-4b2f-8525-b8a32e87eb53) ########## docs/docs/configuration/cache.mdx: ########## @@ -80,6 +80,30 @@ instead requires a cachelib object. See [Async Queries via Celery](/docs/configuration/async-queries-celery) for details. +## Celery beat + +Superset has a Celery task that will periodically warm up the cache based on different strategies. +To use it, add the following to the `CELERYBEAT_SCHEDULE` section in `config.py`: + +```python +SUPERSET_CACHE_WARMUP_USER = "user_with_permission_to_dashboards" + +CELERYBEAT_SCHEDULE = { + 'cache-warmup-hourly': { + 'task': 'cache-warmup', + 'schedule': crontab(minute=0, hour='*'), # hourly + 'kwargs': { + 'strategy_name': 'top_n_dashboards', + 'top_n': 5, + 'since': '7 days ago', + }, + }, +} +``` Review Comment: ### Missing crontab import in configuration example <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The configuration example is missing the required import statement for the `crontab` function, which will cause a NameError when users try to use this configuration. ###### Why this matters Users copying this configuration will encounter a runtime error when Celery tries to parse the schedule, preventing the cache warmup task from being scheduled properly. ###### Suggested change ∙ *Feature Preview* Add the required import statement at the beginning of the code block: ```python from celery.schedules import crontab SUPERSET_CACHE_WARMUP_USER = "user_with_permission_to_dashboards" CELERYBEAT_SCHEDULE = { 'cache-warmup-hourly': { 'task': 'cache-warmup', 'schedule': crontab(minute=0, hour='*'), # hourly 'kwargs': { 'strategy_name': 'top_n_dashboards', 'top_n': 5, 'since': '7 days ago', }, }, } ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/3232ad17-431a-44fd-987d-5221b9f78f23/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/3232ad17-431a-44fd-987d-5221b9f78f23?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/3232ad17-431a-44fd-987d-5221b9f78f23?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/3232ad17-431a-44fd-987d-5221b9f78f23?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/3232ad17-431a-44fd-987d-5221b9f78f23) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:031b42a2-8ad8-4bf5-a97c-2f8b30373fba --> [](031b42a2-8ad8-4bf5-a97c-2f8b30373fba) ########## superset/utils/webdriver.py: ########## @@ -381,10 +413,14 @@ def find_unexpected_errors(driver: WebDriver) -> list[str]: return error_messages - def get_screenshot(self, url: str, element_name: str, user: User) -> bytes | None: # noqa: C901 - driver = self.auth(user) - driver.set_window_size(*self._window) - driver.get(url) + def get_screenshot( # noqa: C901 + self, url: str, element_name: str, user: User | None = None + ) -> bytes | None: + if user and not self._user: + self._user = user + if self._driver: + self._auth(user) Review Comment: ### User switching not handled in WebDriverSelenium <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The user authentication logic in WebDriverSelenium.get_screenshot() only authenticates when there's no existing user, but doesn't handle the case where a different user is passed in subsequent calls. ###### Why this matters If the WebDriverSelenium instance is reused with different users, the second user won't be authenticated because self._user is already set from the first call, potentially causing authorization failures or screenshots being taken with the wrong user context. ###### Suggested change ∙ *Feature Preview* Change the condition to authenticate whenever a different user is provided: ```python if user and user != self._user: self._user = user if self._driver: self._auth(user) ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/ac939f7c-092f-4df0-a5e0-d8b43bbe21f9/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/ac939f7c-092f-4df0-a5e0-d8b43bbe21f9?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/ac939f7c-092f-4df0-a5e0-d8b43bbe21f9?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/ac939f7c-092f-4df0-a5e0-d8b43bbe21f9?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/ac939f7c-092f-4df0-a5e0-d8b43bbe21f9) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:2ea95922-199c-463e-8de6-7c7ce413da7f --> [](2ea95922-199c-463e-8de6-7c7ce413da7f) ########## superset/tasks/cache.py: ########## @@ -307,25 +239,20 @@ def cache_warmup( logger.exception(message) return message - results: dict[str, list[str]] = {"scheduled": [], "errors": []} - for task in strategy.get_tasks(): - username = task["username"] - payload = json.dumps(task["payload"]) - if username: - try: - user = security_manager.get_user_by_username(username) - cookies = MachineAuthProvider.get_auth_cookies(user) - headers = { - "Cookie": f"session={cookies.get('session', '')}", - "Content-Type": "application/json", - } - logger.info("Scheduling %s", payload) - fetch_url.delay(payload, headers) - results["scheduled"].append(payload) - except SchedulingError: - logger.exception("Error scheduling fetch_url for payload: %s", payload) - results["errors"].append(payload) - else: - logger.warn("Executor not found for %s", payload) + results: dict[str, list[str]] = {"success": [], "errors": []} + + user = security_manager.find_user( + username=current_app.config["SUPERSET_CACHE_WARMUP_USER"] + ) + wd = WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user) + + for url in strategy.get_urls(): + try: + logger.info("Fetching %s", url) + wd.get_screenshot(url, "grid-container") + results["success"].append(url) + except URLError: + logger.exception("Error warming up cache!") + results["errors"].append(url) Review Comment: ### WebDriver resource leak in cache warmup loop <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? WebDriver instance is created once and reused for all URLs without proper cleanup, potentially causing resource leaks and browser session accumulation. ###### Why this matters This can lead to memory leaks, zombie browser processes, and degraded performance over time as browser instances accumulate without being properly closed. ###### Suggested change ∙ *Feature Preview* Use a context manager or ensure proper cleanup of the WebDriver instance after processing all URLs: ```python with WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user) as wd: for url in strategy.get_urls(): try: logger.info("Fetching %s", url) wd.get_screenshot(url, "grid-container") results["success"].append(url) except URLError: logger.exception("Error warming up cache!") results["errors"].append(url) ``` Or add explicit cleanup: ```python wd = WebDriverSelenium(current_app.config["WEBDRIVER_TYPE"], user=user) try: for url in strategy.get_urls(): # ... existing loop code ... finally: wd.quit() # or appropriate cleanup method ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/88bc6dd4-46c3-4630-8019-87a13500b071/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/88bc6dd4-46c3-4630-8019-87a13500b071?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/88bc6dd4-46c3-4630-8019-87a13500b071?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/88bc6dd4-46c3-4630-8019-87a13500b071?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/88bc6dd4-46c3-4630-8019-87a13500b071) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:9b79540b-1261-40e7-8602-e81af7bde031 --> [](9b79540b-1261-40e7-8602-e81af7bde031) ########## superset/utils/webdriver.py: ########## @@ -243,7 +246,30 @@ def get_screenshot( # pylint: disable=too-many-locals, too-many-statements # n class WebDriverSelenium(WebDriverProxy): - def create(self) -> WebDriver: + def __init__( + self, + driver_type: str, + window: WindowSize | None = None, + user: User | None = None, + ): + super().__init__(driver_type, window) + self._user = user + self._driver = None + + def __del__(self) -> None: + self._destroy() + + @property + def driver(self) -> WebDriver: + if not self._driver: + self._driver = self._create() + assert self._driver # for mypy + self._driver.set_window_size(*self._window) + if self._user: + self._auth(self._user) + return self._driver Review Comment: ### Lazy WebDriver initialization causes unpredictable latency <sub></sub> <details> <summary>Tell me more</summary> ###### What is the issue? The lazy initialization of WebDriver in the property getter creates expensive operations (browser startup, authentication) on first access, potentially blocking the calling thread. ###### Why this matters This can cause unpredictable performance as the first screenshot operation will be significantly slower than subsequent ones, making it difficult to set appropriate timeouts and potentially causing user-facing delays. ###### Suggested change ∙ *Feature Preview* Move driver initialization to a dedicated method that can be called explicitly when needed, or implement async initialization: ```python def initialize_driver(self) -> None: if not self._driver: self._driver = self._create() self._driver.set_window_size(*self._window) if self._user: self._auth(self._user) ``` ###### Provide feedback to improve future suggestions [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/566e101d-8135-4a16-9e5f-513318e17740/upvote) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/566e101d-8135-4a16-9e5f-513318e17740?what_not_true=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/566e101d-8135-4a16-9e5f-513318e17740?what_out_of_scope=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/566e101d-8135-4a16-9e5f-513318e17740?what_not_in_standard=true) [](https://app.korbit.ai/feedback/aa91ff46-6083-4491-9416-b83dd1994b51/566e101d-8135-4a16-9e5f-513318e17740) </details> <sub> 💬 Looking for more details? Reply to this comment to chat with Korbit. </sub> <!--- korbi internal id:6621d13a-a1ec-4158-bdec-3d79ee226259 --> [](6621d13a-a1ec-4158-bdec-3d79ee226259) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
