villebro opened a new issue, #29515:
URL: https://github.com/apache/superset/issues/29515

   ## [SIP] Proposal for Simplified Global Async Queries
   
   ### Motivation
   
   With [\[SIP-39\] Global Async Query Support](#9190) (GAQ for short) still 
being behind an experimental feature flag, and not actively maintained, I've 
been thinking about ways we could simplify the architecture, and finally make 
this generally available in a forthcoming Superset release. I feel the 
following issues have all done their part to contribute to not gain wide 
community traction:
   - **No deduplication support**: In my experience, the lack of deduplication 
of heavy queries is one of the main bottlenecks of Superset, and tends to cause 
major issues when many people try to access the same charts/dashboards. There 
was an effort to add [deduplication support to GAQ](#14112), but the proposal 
went stale.
   - **Design complexity**: There was heated discussion during the SIP process 
about websockets potentially being too heavy handed for this particular task, 
with channels, reconnection functionality etc. In retrospect, I feel most of 
the raised concerns comments turned out to be true - while the implementation 
may be more elegant than a simple polling solution, very few committers or 
community members ended up fully understanding the minutiae of the 
implementation, causing it to fall out of active maintenance.
   - **New components**: Using the websocket feature required adding a new 
component to the Superset deployment, along with tightly coupling with the 
Redis Streams protocol. A simpler polling solution would likely have received 
more adoption from the community, leading to the feature stabilizing more 
quickly.
    
   Having said all this, the feature is still as relevant today as it was when 
the original SIP was opened, and I think stabilizing this feature is very 
important is because Superset's current synchronous query execution model 
causes lots of issues:
   - If many people open the same chart/dashboard at the same time, they will 
all start a query to the underlying database, due to no locking of queries
   - if a user refreshes a dashboard multiple times, they can quickly congest 
the downstream database with heavy queries, both eating up webserver threads 
and database resources.
   - There's no way to cancel queries that get orphaned by closed browsers.
   - In some cases, the web worker threads/processes get blocked waiting for 
long running queries to complete executing, making it impossible to effectively 
scale web worker replica sets based on CPU consumption. By moving queries to 
async workers it should become possible to get by with much slimmer webworker 
replica sets. Furthermore, async workers could be scaled up/down based on the 
queue depth.
   
   ### Proposed Change
   
   To simplify the architecture and reuse existing functionality, I propose the 
following:
   - The websocket architecture is removed, as it adds a lot of complexity to 
the architecture - in the future only polling would be supported.
   - The concept of a "query context cache key" is removed in favor of only a 
single cache key, i.e. the one we already use for chart data.
   - A new model is introduced for async queries. If data for a particular 
cache key isn't found, an entry is added to the new model, which tracks the 
query progress. The model will get a menu in the UI, from which users will be 
able to cancel their own queries (Admins will see all queries). Ultimately, 
this entry gets deleted once the data request is completed.
   - When requesting chart data, if the data exists in the cache, the data is 
returned normally.
   When chart data isn't available in the cache, only the cache_key is 
returned, along with additional details: when the most recent chart data 
request has been submitted, status (pending, executing), last heartbeat from 
the async worker etc.
   
   The async execution flow is changed to be similar to SQL Lab async 
execution, with the following changes:
   - when the async worker starts executing the query, the cache key is locked 
using the KeyValueDistributedLock context manager. This means that only a 
single worker executes any one cache key query at a time.
   - To support automatic cancellation of queries, we add a new optional field 
`poll_ttl` to the query context, which makes it possible to automatically 
cancel queries that are not being actively polled. Every time the cache key is 
polled, the latest poll time is updated on the metadata object. While 
executing, the worker periodically checks the metadata object, and if the 
`poll_ttl` is defined, and if the last poll time exceeds the TTL, the query is 
cancelled. This ensures that if a person closes a dashboard with lots of long 
running queries, the queries are automatically cancelled if nobody is actively 
waiting for the results. By default, frontend requests have poll_ttl set to 
whichever value is set in the config (`DEFAULT_CHART_DATA_POLL_TTL`). Cache 
warmup requests would likely not have a `poll_ttl` set, so as to avoid 
unnecessary polling.
   - To limit hammering the polling endpoint, we introduce a customizable 
backoff function in `superset_config.py`, which makes it possible to define how 
polling backoff should be implemented. The default behavior would be some sort 
of exponential backoff, where freshly started queries are polled more actively, 
and queries that have been pending/running for a long time are polled less 
frequently. When the frontend requests chart data, the backend provides the 
recommended wait time in the response based on the backoff function. Note, that 
backoff will be based on time passed since query submission time; this means, 
that if I open a dashboard with a chart that has a query that's been running 
for 10 minutes, the browser will repoll much slower than it would if the query 
would have been dispatched to the async workers right away
    
   Some random thoughts:
   - Currently multi-query query contexts get executed serially. With this new 
approach the queries can be executed in parallel, as each query is dispatched 
separately.
   - I feel synchronous execution is very problematic in the context of 
Superset due to the problems described in the intro of this post. Originally I 
thought about proposing making async queries the *only* supported query 
mechanism in Superset. However, as @betodealmeida [pointed 
out](https://github.com/apache/superset/issues/9190#issuecomment-2206778896), 
certain databases that are expected to return data at sub second latencies are 
better suited to a synchronous flow, as dispatching Celery tasks can add a few 
seconds of extra overhead to the process. For this reason, we should probably 
keep async execution an optional feature.  
   
   ### New or Changed Public Interfaces
   
   - New config flag for providing a backoff function for chart data polling. 
This will used to tell the frontend when it should poll for completed chart 
data the next time.
   - New config flag for setting the default chart data poll TTL.
   - New FAB model + DAO for async queries, along with a new menu for managing 
currently queued/executing queries.
   - New optional fields added to chart data request: 
      - `poll_ttl`: if set, the query will be cancelled unless a client has 
asked for data within the TTL bounds. This will ensure that dashboards that are 
closed don't leave orphaned chart data requests.
      - `execution_mode`: the client can ask the query to be executed sync, 
async or using the default mode. This is specifically added for programmatic 
integrations, where implementing the polling mechanism may sometimes add 
unnecessary complexity.
   - New fields added to the chart data response:
      - `poll_delay`: how many seconds should the client wait before checking 
if the query has completed. This value will be calculated by the backend based 
on the backoff function.
      - `status` and `start_dttm`: is the query queued or started, and when did 
the query start executing. This information can be used by the chart component 
to give the user information of the state of the query, similar to how we 
currently display how stale the cached data is. So in the future, we may 
display the following: "Query is queued", or "Query executing for 2 minutes".
   
   ### New dependencies
   
   In the base proposal, I suggest not adding any new dependencies, and simply 
supporting polling. However, we may consider using [Server-sent 
events](https://developer.mozilla.org/en-US/docs/Web/API/Server-sent_events) as 
noted by @betodealmeida .
   
   ### Migration Plan and Compatibility
   
   1. Remove feature flag + related code, Websocket worker code
   2. Remove references to websocket deployment/service from Helm chart
   3. Add new feature flag `SIMPLIFIED_GLOBAL_ASYNC_QUERIES`
   
   ### Rejected Alternatives
   
   - For now I suggest not implementing `Server-sent events` to keep the 
implementation simple.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: notifications-unsubscr...@superset.apache.org
For additional commands, e-mail: notifications-h...@superset.apache.org

Reply via email to