sstriker commented on issue #2083:
URL: https://github.com/apache/buildstream/issues/2083#issuecomment-3454136984

   # Potential approach to implementing "Speculative Actions: Predictive Cache 
Priming for BuildStream"
   
   ## Summary
   
   Proposal for the implementation of **Speculative Actions**; a cache priming 
mechanism that enables fine-grained build parallelism with BuildStream by 
speculatively executing compiler invocations before actual element builds run, 
warming the Remote Execution ActionCache. Familiarity with the Speculative 
Actions proposal is assumed.
   
   **Artifact Overlays** - each artifact includes overlay metadata that maps 
its file digests back to their ultimate sources. This eliminates the need to 
fetch dependency sources during overlay generation, simplifying the 
implementation.
   
   **Core Design Principles:**
   1. **Queue separation of concerns**: Each queue does its job, no 
cross-cutting
   2. **Action-granular processing**: One job per SpeculativeAction for maximum 
parallelism
   3. **Avoid unneccessary work**: No submission of unchanged speculative 
actions
   4. **Incremental discovery**: Progressive reference discovery as artifacts 
are pulled
   5. **Natural composition**: Filter, stack, and compose elements work without 
special casing
   6. **Self-contained artifacts**: Artifact overlays encode complete source 
attribution
   
   ---
   
   ## Data Model
   
   ### Protocol Buffer Definitions
   
   **New file**: 
`src/buildstream/_protos/build/buildstream/v2/speculative_actions.proto`
   
   ```protobuf
   syntax = "proto3";
   
   package build.buildstream.v2;
   
   import "build/bazel/remote/execution/v2/remote_execution.proto";
   
   message SpeculativeActions {
     // Speculative actions for this element's build
     repeated SpeculativeAction actions = 1;
     
     // Overlays that map artifact file digests to their sources
     // Enables downstream elements to resolve dependencies without fetching 
sources
     repeated Overlay artifact_overlays = 2;
     
     // References to other elements' SpeculativeActions
     // Used when an element's artifact is unavailable yet referenced by others
     repeated ReferencedSpeculativeActions referenced_speculative_actions = 3;
     
     message SpeculativeAction {
       // Original action digest from build
       build.bazel.remote.execution.v2.Digest base_action_digest = 1;
       
       // Overlays to apply to instantiate action
       repeated Overlay overlays = 2;
     }
     
     message Overlay {
       enum OverlayType {
         SOURCE = 0;    // From element's source tree
         ARTIFACT = 1;  // From dependency element's artifact output
         ACTION = 2;    // From another speculative action output
       }
       
       OverlayType type = 1;
       
       // Element name providing the source
       string source_element = 2;
       
       // For ACTION type: digest of the source action
       build.bazel.remote.execution.v2.Digest source_action = 3;
       
       // Path within source (source tree, artifact, or action output)
       string source_path = 4;
       
       // When instantiating the action, find all occurrences of this digest
       // in the action's input tree (across all paths) and replace with the
       // digest of the file at source_path from source_element. If multiple
       // files in the input tree share this digest, all will be replaced with
       // the same source.
       build.bazel.remote.execution.v2.Digest target_digest = 5;
     }
     
     message ReferencedSpeculativeActions {
       // Element name that this reference points to
       string element = 1;
       
       // Digest of that element's SpeculativeActions
       build.bazel.remote.execution.v2.Digest speculative_actions = 2;
     }
   }
   ```
   
   **Modified**: `src/buildstream/_protos/buildstream/v2/artifact.proto`
   
   ```protobuf
   message Artifact {
     // ... existing fields ...
     
     // Speculative actions for cache priming (field 19)
     build.bazel.remote.execution.v2.Digest speculative_actions = 19;
   }
   ```
   
   **REAPI Extension** (coordinate with buildbox-worker team):
   
   ```protobuf
   message ActionResult {
     // ... existing REAPI fields ...
     
     // Digests of spawned Actions during execution
     // For BuildStream: compiler invocations recorded by buildbox 
infrastructure
     repeated Digest subactions = 10;
   }
   ```
   
   ### Data Model Rationale
   
   **Self-contained SpeculativeActions**:
   - Single proto file import for any code using this feature
   - Clear namespace and ownership
   
   **Artifact overlays enable composition**:
   - Each artifact carries its own source attribution
   - Downstream elements compose by looking up in dependency artifact_overlays
   - No need to fetch dependency sources during overlay generation
   
   **Three overlay types**:
   - **SOURCE**: Direct reference to element's source tree (most ideal)
   - **ACTION**: Reference to another subaction's output (build-generated files)
   - **ARTIFACT**: Fallback when source attribution unknown (dependency 
artifact file)
   
   ---
   
   ## Artifact-level Overlays
   
   ### What They Solve
   
   When generating overlays for element `app`, a subaction references a file 
with digest `ABC123`. This file exists in dependency `liba`'s artifact at 
`/usr/include/header.h`. But where did `liba` get this file? Did it come from 
`liba`'s sources, or from `liba`'s dependency `base`?
   
   Without artifact overlays, we would need to:
   1. Fetch `liba` sources and check for the file
   2. If not found, fetch `base` sources and check
   3. Continue up the dependency chain until found
   
   This requires complex FetchQueue coordination, callbacks, retry logic, and 
state management.
   
   **With artifact overlays**: When `liba` was built, it generated 
`artifact_overlays` that recorded where each artifact file came from. Looking 
up `ABC123` in `liba.artifact_overlays` immediately returns:
   
   ```protobuf
   Overlay {
     type: SOURCE
     source_element: "base"
     source_path: "src/header.h"
     target_digest: ABC123
   }
   ```
   
   Now we know the file came from `base/src/header.h`. No source fetches needed.
   
   ### How They Work
   
   **Generated during element build**:
   
   When an element completes its build, SpeculativeActionGenerationQueue:
   1. Generates overlays for the element's subactions
   2. Generates `artifact_overlays` for the element's artifact files using the 
same digest resolution algorithm
   3. Writes both to SpeculativeActions attached to the artifact
   
   **Used by downstream elements**:
   
   When a downstream element generates overlays:
   1. Subaction references file with digest `XYZ`
   2. Find `XYZ` in dependency artifact
   3. Look up `XYZ` in dependency's `artifact_overlays`
   4. Use the overlay from `artifact_overlays` for this subaction
   
   ### Transitive Source Attribution
   
   **Example**:
   
   ```
   base:
     - Source: src/common.h (digest XYZ)
     - Artifact: /usr/include/common.h (digest XYZ)
     - Generates artifact_overlays:
         Overlay {
           type: SOURCE
           source_element: "base"
           source_path: "src/common.h"
           target_digest: XYZ
         }
   
   liba (depends on base):
     - No source for common.h
     - Artifact: /usr/include/common.h (digest XYZ) [from base]
     - Generates artifact_overlays:
         1. Find XYZ in base artifact
         2. Look up XYZ in base.artifact_overlays
         3. Copy the overlay to liba.artifact_overlays:
            Overlay {
              type: SOURCE
              source_element: "base"
              source_path: "src/common.h"
              target_digest: XYZ
            }
   
   app (depends on liba):
     - Subaction references common.h (digest XYZ)
     - Generates overlay:
         1. Find XYZ in liba artifact
         2. Look up XYZ in liba.artifact_overlays
         3. Use the overlay:
            Overlay {
              type: SOURCE
              source_element: "base"
              source_path: "src/common.h"
              target_digest: XYZ
            }
   ```
   
   Source attribution flows through artifact_overlays. Each element encodes 
"where my artifact files came from" once, and all downstream elements benefit.
   
   ### Natural Handling of Composition Elements
   
   **Filter elements**:
   
   ```
   filter-element:
     depends: base
     config:
       include: ['/usr/bin/*']
   ```
   
   When generating `artifact_overlays` for filter:
   - Filter artifact contains subset of base artifact files
   - For each file in filter artifact:
     - Find digest in base artifact
     - Look up in base.artifact_overlays
     - Copy overlay to filter.artifact_overlays
   - Result: filter.artifact_overlays contains only included files
   
   Downstream elements using filter get the same source attribution as if they 
used base directly.
   
   **Stack elements**:
   
   ```
   stack-element:
     depends: [liba, libb]
   ```
   
   When generating `artifact_overlays` for stack:
   - Stack artifact contains files from both liba and libb
   - For each file in stack artifact:
     - Check if from liba: look up in liba.artifact_overlays
     - Check if from libb: look up in libb.artifact_overlays
     - Copy matching overlay to stack.artifact_overlays
   - Result: stack.artifact_overlays = union of dependency overlays
   
   Downstream elements get correct source attribution for each file.
   
   **Compose elements**:
   
   ```
   compose-element:
     depends: base
     config:
       include: [integration/]
   ```
   
   When generating `artifact_overlays` for compose:
   - Compose may rewrite paths but preserves file digests
   - For each file in compose artifact:
     - Find digest in base artifact (may be different path)
     - Look up in base.artifact_overlays
     - Copy overlay to compose.artifact_overlays
   - Result: Source attribution preserved despite path changes
   
   **Why no special casing needed**: The overlay generation algorithm only 
matches digests. Composition elements create artifacts whose file digests 
appear in their dependencies. Lookup in artifact_overlays naturally traces 
through composition to ultimate sources.
   
   ---
   
   ## Scheduler Queue Flow
   
   ### Queue Sequence
   
   ```
   1. TrackQueue              - Track source references
   2. SourcePushQueue         - Push sources to remote
   3. FetchQueue              - Fetch sources
   4. PullQueue               - Pull artifacts from remote
   
   5. SpeculativeCachePrimingQueue (NEW)
      ├─ Runs concurrently with BuildQueue
      ├─ Entry: Element completes PullQueue
      ├─ Creates one job per SpeculativeAction
      ├─ Jobs trigger FetchQueue for SOURCE overlays
      ├─ Jobs trigger PullQueue for ARTIFACT overlays
      ├─ Jobs wait for other jobs for ACTION overlays
      └─ Submits actions to remote execution
   
   6. BuildQueue              - Build elements
   
   7. SpeculativeActionGenerationQueue (NEW)
      ├─ Entry: Element completes BuildQueue (new build)
      ├─ Generates SpeculativeActions and artifact_overlays
      ├─ No dependency source fetching needed
      └─ Attaches to artifact
   
   8. ArtifactPushQueue       - Push artifacts
   ```
   
   ### Queue Behavior
   
   **SpeculativeCachePrimingQueue**:
   - Triggered by: PullQueue completion
   - Triggers: FetchQueue (SOURCE overlays), PullQueue (ARTIFACT overlays)
   - Waits for: Other priming jobs (ACTION overlays)
   - Skips when: Network disabled, element cached, priming disabled
   
   **SpeculativeActionGenerationQueue**:
   - Triggered by: BuildQueue completion
   - Triggers: Nothing
   - Waits for: Nothing
   - Processes immediately (all data available)
   
   ---
   
   ## Speculative Action Generation
   
   ### When It Happens
   
   SpeculativeActionGenerationQueue processes elements that complete BuildQueue 
with a new build (not a cache hit).
   
   All necessary data is already available:
   - Element's sources (fetched before build)
   - Subaction digests (from build's ActionResult)
   - Element's artifact (just created)
   - Dependency artifacts (pulled before build)
   - Dependency artifact_overlays (in artifacts)
   
   No waiting, no triggers, no callbacks.
   
   ### Algorithm
   
   **For each subaction**:
   
   ```
   1. Fetch base Action from CAS
   2. Traverse Action's input_root_digest to extract all file digests
   3. For each digest, resolve to an Overlay:
   
      a. Check element's own sources
         If found: return SOURCE overlay for current element
      
      b. Check element's other subaction outputs
         If found: return ACTION overlay for current element
      
      c. Check each dependency artifact:
         - Find digest in dependency artifact
         - If found:
           - Look up digest in dependency.artifact_overlays
           - If found: return that overlay
           - If not found: return ARTIFACT overlay (fallback)
      
      d. If not found anywhere: skip this digest
   
   4. Create SpeculativeAction with base_action_digest and collected overlays
   ```
   
   **For artifact_overlays**:
   
   ```
   For each file in element's artifact:
     1. Get file's digest
     2. Apply same resolution algorithm as above
     3. If overlay found: add to artifact_overlays
   ```
   
   ### Example
   
   **Scenario**:
   
   ```
   base:
     - Source: src/util.c (digest ABC)
     - Artifact: /usr/include/util.h (digest ABC)
   
   liba (depends on base):
     - Source: src/liba.c (digest DEF)
     - Subaction: compile liba.c
       - References: liba.c (DEF), util.h (ABC)
     - Artifact: /usr/lib/liba.so (digest GHI)
   
   app (depends on liba):
     - Source: src/app.c (digest JKL)
     - Subaction: compile app.c
       - References: app.c (JKL), liba.so (GHI), util.h (ABC)
   ```
   
   **base generates**:
   
   ```
   artifact_overlays:
     Overlay { type: SOURCE, source_element: base,
               source_path: src/util.c, target_digest: ABC }
   ```
   
   **liba generates**:
   
   ```
   For subaction overlays:
     DEF → SOURCE overlay (liba, src/liba.c)
     ABC → Look up in base.artifact_overlays → SOURCE overlay (base, src/util.c)
   
   artifact_overlays:
     GHI → ACTION overlay (liba, subaction, liba.so)
     ABC → Look up in base.artifact_overlays → SOURCE overlay (base, src/util.c)
   ```
   
   **app generates**:
   
   ```
   For subaction overlays:
     JKL → SOURCE overlay (app, src/app.c)
     GHI → Look up in liba.artifact_overlays → ACTION overlay (liba, subaction, 
liba.so)
     ABC → Look up in liba.artifact_overlays → SOURCE overlay (base, src/util.c)
   ```
   
   No dependency source fetches. All attribution flows through 
artifact_overlays.
   
   ---
   
   ## Cache Priming
   
   ### SpeculativeCachePrimingQueue
   
   **Entry**: Element completes PullQueue with SpeculativeActions available.
   
   **Processing**:
   
   ```
   For each SpeculativeAction in element's SpeculativeActions:
     Create one SpeculativeCachePrimingJob
   ```
   
   One job per action (not per element) enables:
   - Maximum parallelism across thousands of actions
   - Fine-grained dependency tracking
   - Early completion for resolved actions
   - Natural load balancing
   
   ### SpeculativeCachePrimingJob
   
   **Purpose**: Instantiate one SpeculativeAction and submit to remote 
execution.
   
   **Process**:
   
   ```
   1. Fetch base Action from CAS
   2. Clone Action
   3. For each overlay:
        Apply overlay (with dependency triggering)
   4. If all overlays resolved or skipped:
        If Action is different from base_action:
          Store modified Action in CAS
          Submit to remote execution
   5. If overlays pending:
        Wait, retry when dependencies available
   ```
   
   ### Overlay Application
   
   **SOURCE overlay**:
   
   ```
   1. Get source element
   2. If sources cached:
        Find digest in source tree at source_path
        Replace target_digest with found digest
        Return RESOLVED
   3. If sources not cached:
        Trigger FetchQueue for source element
        Register callback for retry
        Return PENDING
   ```
   
   **ARTIFACT overlay**:
   
   ```
   1. Get dependency element
   2. If artifact available:
        Find digest in artifact at source_path
        Replace target_digest with found digest
        Return RESOLVED
   3. If artifact not available:
        Trigger PullQueue for dependency
        Register callback for retry
        Return PENDING
   ```
   
   **ACTION overlay**:
   
   ```
   1. Check if source action complete (in action_results_cache)
   2. If complete:
        Find digest in action outputs at source_path
        Replace target_digest with found digest
        Return RESOLVED
   3. If not complete:
        Register callback for retry
        Return PENDING
   ```
   
   ### Callbacks and Retry
   
   When a job has pending overlays:
   1. Register callbacks with scheduler
   2. Scheduler tracks: queue/element → callbacks
   3. When dependency completes, scheduler invokes callbacks
   4. Job retries overlay application
   
   ### Shared State
   
   All priming jobs share:
   - `action_map`: base_action_digest → instantiated_action_digest
   - `action_results_cache`: instantiated_action_digest → ActionResult
   
   When job completes:
   1. Add to action_map
   2. Submit to remote execution
   3. Asynchronously fetch ActionResult
   4. Add to action_results_cache
   5. Notify waiting jobs
   
   ### Incremental Reference Discovery
   
   **Problem**: Element X depends on Y, but Y's artifact not available locally. 
X cannot prime without Y's SpeculativeActions.
   
   **Solution**: ReferencedSpeculativeActions in artifact metadata.
   
   When generating SpeculativeActions:
   1. Identify elements appearing in overlays
   2. Add ReferencedSpeculativeActions for each
   
   When pulling artifact:
   1. Scan referenced_speculative_actions
   2. Add to reference index
   3. Notify waiting elements
   
   When priming without artifact:
   1. Check reference index
   2. Use referenced SpeculativeActions if found
   3. Otherwise mark as waiting
   
   ---
   
   ## Component Outline
   
   ### New Components
   
   **Protocol Buffers**:
   - `src/buildstream/_protos/build/buildstream/v2/speculative_actions.proto`
   
   **Speculative Action Generation**:
   - `src/buildstream/_speculative_actions/generator.py`
     - SpeculativeActionGenerator class
     - Main: generate_speculative_actions(element, subaction_digests)
     - Digest resolution with priority
     - Artifact overlay generation
     - Directory traversal for digest extraction
   
   **Speculative Action Instantiation**:
   - `src/buildstream/_speculative_actions/instantiator.py`
     - SpeculativeActionInstantiator class
     - Main: instantiate_action_with_triggers(spec_action, callbacks)
     - Apply overlays with dependency triggering
     - Shared state: action_map, action_results_cache
   
   **Queues**:
   - `src/buildstream/_scheduler/queues/speculativeactiongenerationqueue.py`
     - SpeculativeActionGenerationQueue
     - SpeculativeActionGenerationJob
     
   - `src/buildstream/_scheduler/queues/speculativecacheprimingqueue.py`
     - SpeculativeCachePrimingQueue
     - SpeculativeCachePrimingJob
     - SpeculativeExecutionTracker (for cancellation)
   
   **Reference Discovery**:
   - `src/buildstream/_speculative_actions/reference_index.py`
     - ReferenceIndex class
     - Track referenced elements
     - Simple dict-based implementation
   
   ### Modified Components
   
   **Element** (`src/buildstream/element.py`):
   - Store subaction digests from ActionResult
   - Cache priming configuration support
   
   **Scheduler** (`src/buildstream/_scheduler/scheduler.py`):
   - Add new queues to queue list
   - Callback registration and notification
   - ACTION overlay callback infrastructure
   
   **ArtifactCache** (`src/buildstream/_artifactcache/artifactcache.py`):
   - `update_speculative_actions(element, spec_actions)`
   - `get_speculative_actions(element)`
   - Prefetch SpeculativeActions for weak cache keys
   
   ---
   
   ## Configuration
   
   ### Project Configuration
   
   **File**: `project.conf`
   
   ```yaml
   cache-priming:
     enabled: true
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to