mengw15 commented on code in PR #5085:
URL: https://github.com/apache/texera/pull/5085#discussion_r3251943189


##########
amber/src/main/python/core/storage/document_factory.py:
##########
@@ -96,19 +98,32 @@ def create_document(uri: str, schema: Schema) -> 
VirtualDocument:
             )
 
     @staticmethod
-    def open_document(uri: str) -> typing.Tuple[VirtualDocument, 
Optional[Schema]]:
+    def document_exists(uri: str) -> bool:
+        """
+        Check whether a document exists at the given URI without opening it.
+
+        Returns True iff the underlying storage already has an entry for this
+        URI (e.g., an iceberg table at the resolved namespace + storage key).
+        """

Review Comment:
   Docstring describes the happy path but not the failure modes. Callers using 
this to gate a `createDocument` call won't know what to catch. Suggest adding 
to both this method and its Scala counterpart (`DocumentFactory.scala:108-115`):
   - Scala: `@throws UnsupportedOperationException` for non-`vfs` scheme, 
`@throws IllegalArgumentException` for unsupported resource type.
   - Python: list `NotImplementedError` (non-`vfs`) and `ValueError` 
(unsupported resource type).
   
   Mirrors the implicit contract `openDocument` already has but never documents 
— this is a chance to do better.



##########
amber/src/main/python/core/storage/document_factory.py:
##########
@@ -43,6 +43,16 @@ class DocumentFactory:
 
     ICEBERG = "iceberg"
 
+    @staticmethod
+    def _resolve_namespace(resource_type: VFSResourceType) -> str:
+        match resource_type:
+            case VFSResourceType.RESULT:
+                return StorageConfig.ICEBERG_TABLE_RESULT_NAMESPACE
+            case VFSResourceType.STATE:
+                return StorageConfig.ICEBERG_TABLE_STATE_NAMESPACE
+            case _:
+                raise ValueError(f"Resource type {resource_type} is not 
supported")

Review Comment:
   Python `_resolve_namespace` covers `RESULT` + `STATE`; the Scala counterpart 
at `DocumentFactory.scala:43-52` also covers `CONSOLE_MESSAGES` + 
`RUNTIME_STATISTICS`. The existing test at `test_document_factory.py:91-92` (`# 
CONSOLE_MESSAGES has no namespace mapping in the Python factory.`) confirms 
this is intentional — but the next contributor reading the new shared helper 
will likely try to "fix" the asymmetry. Worth a one-line comment here 
documenting that the Python runtime only owns RESULT/STATE tables (the others 
are written from Scala only).



##########
common/workflow-core/src/test/scala/org/apache/texera/amber/storage/result/iceberg/IcebergDocumentSpec.scala:
##########
@@ -144,6 +144,30 @@ class IcebergDocumentSpec extends 
VirtualDocumentSpec[Tuple] with BeforeAndAfter
     }
   }
 
+  it should "report documentExists=true for a created URI and false for a 
fresh one" in {
+    assert(DocumentFactory.documentExists(uri))
+
+    val freshBase = VFSURIFactory.createPortBaseURI(
+      WorkflowIdentity(0),
+      ExecutionIdentity(0),
+      GlobalPortIdentity(
+        PhysicalOpIdentity(
+          logicalOpId = 
OperatorIdentity(s"fresh-${UUID.randomUUID().toString.replace("-", "")}"),
+          layerName = "main"
+        ),
+        PortIdentity()
+      )
+    )
+    val freshUri = VFSURIFactory.resultURI(freshBase)
+    assert(!DocumentFactory.documentExists(freshUri))
+  }

Review Comment:
   Single test asserts both directions (`true` for created URI, `false` for 
fresh URI). If the second assert fails, the test name reads as a true/false 
bundle and you have to count to figure out which scenario broke. Suggest 
splitting into two `it should`s — `"return true after createDocument"` and 
`"return false for a never-created URI"`. Optional, purely diagnostic quality.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to