kunwp1 commented on code in PR #5280:
URL: https://github.com/apache/texera/pull/5280#discussion_r3369916014


##########
common/workflow-core/src/main/scala/org/apache/texera/service/util/LargeBinaryManager.scala:
##########
@@ -20,44 +20,88 @@
 package org.apache.texera.service.util
 
 import com.typesafe.scalalogging.LazyLogging
+import org.apache.texera.amber.config.StorageConfig
 
 import java.util.UUID
 
-/**
-  * Manages the lifecycle of LargeBinaries stored in S3.
-  *
-  * Handles creation and deletion of large objects that exceed
-  * normal tuple size limits.
-  */
+/** Manages the lifecycle of LargeBinaries (objects too large for normal 
tuples) in S3. */
 object LargeBinaryManager extends LazyLogging {
-  val DEFAULT_BUCKET: String = "texera-large-binaries"
+  // From config so the JVM/Python workers and cleanup all share one bucket.
+  val DEFAULT_BUCKET: String = StorageConfig.s3LargeBinariesBucket
+
+  /** Per-execution key prefix; the single source of truth for the write and 
delete paths. */
+  private def executionPrefix(executionId: Long): String = 
s"objects/$executionId"
 
   /**
-    * Creates a new LargeBinary reference.
-    * The actual data upload happens separately via LargeBinaryOutputStream.
+    * Base URI (trailing slash) under which `executionId`'s large binaries 
live; create()
+    * appends a unique suffix. Empty when the bucket is unconfigured, so 
create() fails loudly.
     *
-    * @return S3 URI string for the new LargeBinary (format: s3://bucket/key)
+    * `executionId` must be a persisted EID. The sentinel id 
(DEFAULT_EXECUTION_ID = 1) shares

Review Comment:
   Confirmed this never happens.
   1. Large binaries are only created by workers calling `create()`
   2. Workers are only started when a workflow actually executes
   3. Before any workers starts, `initExecutionService` inserts a row into 
`WORKFLOW_EXECUTIONS`, gets back a real EID, and overwrites the context's 
default 1 with it.
   4. The code that passes workers their `objects/{eid}/` location runs after 
that. 
   
   So by the time any large binary exists, the eid is always a real EID.
   
   I didn't add a guard because I think the real execution ID can also be 1.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to