GitHub user zyratlo created a discussion: Notebook Migration tool: database schema design
## 1. Context The Notebook Migration tool is an in-development feature that utilizes LLM capabilities to migrate user-uploaded Python Jupyter notebooks to Texera workflows. When the user uploads a notebook, the LLM will return the generated workflow and a mapping. The mapping serves as the link between the notebook and the workflow, containing information about which notebook cells correspond to which workflow operators and vice versa. This tool requires two pieces of information to be stored in the Texera database: the uploaded notebook and the mapping, both of which are JSONs. This discussion focuses on the database schema design for these two data. ## 2. Current Design  The current design adds two new tables: `workflow_notebook_source` and `workflow_notebook_mapping`. The former stores the notebook and the latter stores the mapping. ### 2a. workflow_notebook_source This table relates to the existing `workflow` table and uses the `wid` as its primary key, and stores the notebook as a JSON binary. Note that this table does not need to be a standalone table, as `notebook` can be merged into the `workflow` table as a default null value. As a consequence of this design, only one notebook can ever be tied to a workflow (this limitation is discussed later). ### 2b. workflow_notebook_mapping This table relates to the existing `workflow_version` table and uses `wid` and `vid` as primary keys. The reason the mapping relates to the `workflow_version` table instead of the `workflow` table is because when the user edits the workflow, we want to generate another mapping between the new workflow and the original notebook. This design allows us to store a mapping for every workflow version. ### 2c. Limitations of this design As mentioned earlier, this design only allows one notebook to ever be associated with a workflow, basically assuming that notebooks are immutable after workflow generation. This is problematic, because if the user edits the notebook after generation then our mappings will no longer function. We can't store the notebook in the same table as the mapping because if we change the notebook but not the workflow, we won't have a new `(wid, vid)` key since the workflow did not change. ## 3. Possible Solutions/Workarounds A workaround that maintains the current design is to create a new workflow every time the notebook is modified. This way the mappings will still be valid after the notebook is changed. However this is not desirable as it adds unnecessary complexity to the user, who now has to manage multiple workflows for the same project. Other solutions would likely require modifying or adding more tables in order to support multiple notebook versions while maintaining mapping validity. I am currently in the process of brainstorming alternative better designs. GitHub link: https://github.com/apache/texera/discussions/4175 ---- This is an automatically sent email for [email protected]. To unsubscribe, please send an email to: [email protected]
