michael-s-molina commented on PR #36368:
URL: https://github.com/apache/superset/pull/36368#issuecomment-3835015191

   @villebro I did a section with AI focused on the database overhead problem 
raised by @mistercrunch. AI had great recommendations. Given that no current 
task was migrated to GTF yet, feel free to tackle these recommendations in this 
PR or in a follow-up if you want a fresh/focused context.
   
   ## Database Overhead Analysis for Global Task Framework
   
   ### Tables & Indexes
   
   - **`tasks`** table with 8 indexes (dedup_key, status, scope, task_type, 
uuid, timestamps, created_by)
   - **`task_subscribers`** table with 1 index
   
   ### Per-Task Database Operations
   
   | Task Type                     | DB Writes |
   | ----------------------------- | --------- |
   | Minimal (no progress updates) | ~5        |
   | Typical (5 progress updates)  | ~10       |
   | Heavy (frequent updates)      | 15+       |
   
   ### Abort Polling Overhead (without Redis)
   
   When Redis pub/sub is unavailable, the framework falls back to database 
polling for abort detection. The current implementation spawns **one polling 
thread per abortable task**, each executing:
   
   ```sql
   SELECT * FROM tasks WHERE uuid = ? AND status IN ('ABORTING', 'ABORTED')
   ```
   
   With default 10-second intervals, this results in:
   
   - **10 concurrent tasks** = 60 queries/minute
   - **100 concurrent tasks** = 600 queries/minute
   
   #### Suggested Optimization: Batched Polling
   
   A centralized coordinator could batch these into a single query:
   
   ```sql
   SELECT uuid FROM tasks WHERE uuid IN (...) AND status IN ('ABORTING', 
'ABORTED')
   ```
   
   | Approach            | Queries/min (10 tasks) | Threads |
   | ------------------- | ---------------------- | ------- |
   | Current (per-task)  | 60                     | 10      |
   | Batched coordinator | 6                      | 1       |
   
   This would reduce both query volume and thread count by ~90%.
   
   ### Progress Update Overhead
   
   Each `ctx.update_task()` call executes 1 SELECT + 1 UPDATE. Tasks with tight 
loops can generate significant DB load:
   
   ```python
   for i in range(10000):
       process_item(i)
       ctx.update_task(progress_percent=i/100)  # 10,000 DB writes
   ```
   
   #### Suggested Optimization: Framework-Level Throttling
   
   The framework should protect itself from write storms rather than relying on 
task implementers. `TaskContext.update_task()` could:
   
   1. Track last DB write timestamp
   2. Only flush to DB if `min_update_interval` (e.g., 1 second) has elapsed
   3. Keep latest values in memory between flushes
   4. Always flush on task completion
   
   This allows task code to call `update_task()` freely without worrying about 
DB overhead.
   
   ### UpdateTaskCommand Overhead
   
   Each `update_task()` call currently executes:
   
   1. `validate()` - 1 SELECT to find task
   2. `_execute_update()` - 1 SELECT again (re-fetch under lock)
   3. `TaskDAO.update()` - 1 UPDATE
   
   **Total: 2 SELECTs + 1 UPDATE per call**
   
   Additionally, `TaskContext._task` property does a fresh SELECT every access.
   
   #### Suggested Optimizations
   
   1. **Cache task state in TaskContext** - Avoid repeated SELECTs during 
execution; task runs single-threaded so caching is safe
   2. **Eliminate double SELECT** - `validate()` and `_execute_update()` both 
fetch the task; could reuse
   3. **Skip locking for progress-only updates** - Distributed lock on every 
update is overkill; progress updates don't conflict with submit/cancel
   4. **Write-behind for progress** - Queue updates in memory, flush 
periodically or on completion (combines with throttling)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to