aicam opened a new issue, #4324: URL: https://github.com/apache/texera/issues/4324
### Task Summary ### **Title: Refactor `dataset` to generalized `asset` concept to support new resource types** #### **Context** Currently in Texera, we use the keyword `dataset` to represent a group of files a user has uploaded into a LakeFS repository. Essentially, a `dataset` acts as a file system. This concept is heavily coupled throughout our stack: it functions as a resource in the dashboard, defines types and structures during workflow execution, and serves as a direct reference point within UDFs and Python code. #### **Motivation** We are planning to introduce a new resource type called `model`. Under the hood, a `model` is architecturally identical to a `dataset`: it is simply a repository containing files and folders in a tree structure (similar to an S3 bucket), backed by LakeFS and MinIO. Because models and datasets share the exact same underlying `file-system` storage mechanism, the interpretation of the files stored in MinIO should be decoupled from the storage structure itself. Instead, the reading process (e.g., a UDF operator reading a file as a binary) should dictate how the content is interpreted. #### **Proposed Solution** To generalize our current LakeFS/MinIO storage architecture to support both `datasets` and `models`, we need to refactor the codebase to use a broader abstraction. We propose introducing a new core keyword: **`asset`**. The `asset` concept will act as the universal pointer to our storage layer, encompassing various specific resource types, including both `dataset` and `model`. #### **Tasks & Acceptance Criteria** To implement this abstraction, we need to replace occurrences of `dataset` with the generalized `asset` keyword across the stack. - [ ] **Database:** Rename all `dataset` occurrences in Postgres table names. - [ ] **Common Utilities:** Update all storage utilities in the `common` directory that currently refer to `dataset`. - [ ] **File Service:** Refactor all files within `file-service` to use the `asset` terminology. - [ ] **UDF Definitions:** Update UDF classes and type definitions that currently hardcode `dataset` as the sole reference to storage. ### Priority P2 – Medium ### Task Type - [x] Code Implementation - [ ] Documentation - [x] Refactor / Cleanup - [x] Testing / QA - [x] DevOps / Deployment -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
