AnzhiZhang commented on issue #3683: URL: https://github.com/apache/texera/issues/3683#issuecomment-3222378746
From our discussion today. We decided to allow users to create a dataset with a name used by another user. ## Solution In LakeFS, files are organized by repository, branch, hierarchy of files, and commits. Currently, we are using the dataset name as the repository name, and files in each repository. We want to introduce a new key to be used as a repository name in LakeFS. ## How to Change and Migrate ### Database This is our current dataset table. <img width="2636" height="180" alt="Image" src="https://github.com/user-attachments/assets/2a2770e3-193c-4765-ae67-1ae95237895e" /> We will add a new column (e.g. `repo_name`) to record the repository name used in LakeFS. ### Code 1. Generate the repository name at dataset creation. 2. Use the repository name instead in LakeFS calls. 3. Update backend validation of dataset creation. 4. Update frontend validation of dataset creation. ### Challange **If we decide to use a UUID.** Since the underlying storage of LakeFS is S3 (Minio currently), assume changing existing repository names is impossible. <img width="790" height="512" alt="Image" src="https://github.com/user-attachments/assets/46dd9240-54c2-497b-b24b-6291a30909eb" /> It will introduce some code migration, and this will lead to huge complications in the code. Therefore, old names cannot be changed. We will continue to use old names for old datasets and use new names as UUIDs. The new column in the database will use the string data type. Also, using UUID will have some readability problems when debugging in LakeFS and S3. **If we decide to use a combination of user ID/name and dataset ID/name (e.g., `1-2-tweets-500`)** This seems a more feasible and the best solution. There will still be some readability problems if we do not include a name in the format. User can update their username, so it is not a good idea to include the username in the repository name. Also, if we addressed this issue, it is reasonable to allow users to edit the dataset name. We can also choose a combination of dataset ID and UUID, or just pure dataset ID. ## Discussion Items in Meeting - Alternative formats of repository name in LakeFS, UUID, or a combination of user ID/name and dataset ID/name - Should allow the user to create a dataset with a name used by themselves (yes for Google Docs) - Should allow edition of dataset name -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
