aicam opened a new issue, #4324:
URL: https://github.com/apache/texera/issues/4324

   ### Task Summary
   
   ### **Title: Refactor `dataset` to generalized `asset` concept to support 
new resource types**
   
   #### **Context**
   Currently in Texera, we use the keyword `dataset` to represent a group of 
files a user has uploaded into a LakeFS repository. Essentially, a `dataset` 
acts as a file system. This concept is heavily coupled throughout our stack: it 
functions as a resource in the dashboard, defines types and structures during 
workflow execution, and serves as a direct reference point within UDFs and 
Python code.
   
   #### **Motivation**
   We are planning to introduce a new resource type called `model`. Under the 
hood, a `model` is architecturally identical to a `dataset`: it is simply a 
repository containing files and folders in a tree structure (similar to an S3 
bucket), backed by LakeFS and MinIO. 
   
   Because models and datasets share the exact same underlying `file-system` 
storage mechanism, the interpretation of the files stored in MinIO should be 
decoupled from the storage structure itself. Instead, the reading process 
(e.g., a UDF operator reading a file as a binary) should dictate how the 
content is interpreted. 
   
   #### **Proposed Solution**
   To generalize our current LakeFS/MinIO storage architecture to support both 
`datasets` and `models`, we need to refactor the codebase to use a broader 
abstraction. 
   
   We propose introducing a new core keyword: **`asset`**. 
   The `asset` concept will act as the universal pointer to our storage layer, 
encompassing various specific resource types, including both `dataset` and 
`model`.
   
   #### **Tasks & Acceptance Criteria**
   To implement this abstraction, we need to replace occurrences of `dataset` 
with the generalized `asset` keyword across the stack.
   
   - [ ] **Database:** Rename all `dataset` occurrences in Postgres table names.
   - [ ] **Common Utilities:** Update all storage utilities in the `common` 
directory that currently refer to `dataset`.
   - [ ] **File Service:** Refactor all files within `file-service` to use the 
`asset` terminology.
   - [ ] **UDF Definitions:** Update UDF classes and type definitions that 
currently hardcode `dataset` as the sole reference to storage.
   
   ### Priority
   
   P2 – Medium
   
   ### Task Type
   
   - [x] Code Implementation
   - [ ] Documentation
   - [x] Refactor / Cleanup
   - [x] Testing / QA
   - [x] DevOps / Deployment


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to