GitHub user zyratlo created a discussion: Notebook Migration tool: database 
schema design

## 1. Context
The Notebook Migration tool is an in-development feature that utilizes LLM 
capabilities to migrate user-uploaded Python Jupyter notebooks to Texera 
workflows. When the user uploads a notebook, the LLM will return the generated 
workflow and a mapping. The mapping serves as the link between the notebook and 
the workflow, containing information about which notebook cells correspond to 
which workflow operators and vice versa. This tool requires two pieces of 
information to be stored in the Texera database: the uploaded notebook and the 
mapping, both of which are JSONs. This discussion focuses on the database 
schema design for these two data.

## 2. Current Design
![Screenshot_20260120_142754_Firefox](https://github.com/user-attachments/assets/5053733c-1745-47d5-81a8-dad0dc86f179)
The current design adds two new tables: `workflow_notebook_source` and 
`workflow_notebook_mapping`. The former stores the notebook and the latter 
stores the mapping.

### 2a. workflow_notebook_source
This table relates to the existing `workflow` table and uses the `wid` as its 
primary key, and stores the notebook as a JSON binary. Note that this table 
does not need to be a standalone table, as `notebook` can be merged into the 
`workflow` table as a default null value. As a consequence of this design, only 
one notebook can ever be tied to a workflow (this limitation is discussed 
later).

### 2b. workflow_notebook_mapping
This table relates to the existing `workflow_version` table and uses `wid` and 
`vid` as primary keys. The reason the mapping relates to the `workflow_version` 
table instead of the `workflow` table is because when the user edits the 
workflow, we want to generate another mapping between the new workflow and the 
original notebook. This design allows us to store a mapping for every workflow 
version.

### 2c. Limitations of this design
As mentioned earlier, this design only allows one notebook to ever be 
associated with a workflow, basically assuming that notebooks are immutable 
after workflow generation. This is problematic, because if the user edits the 
notebook after generation then our mappings will no longer function. We can't 
store the notebook in the same table as the mapping because if we change the 
notebook but not the workflow, we won't have a new `(wid, vid)` key since the 
workflow did not change.

## 3. Possible Solutions/Workarounds
A workaround that maintains the current design is to create a new workflow 
every time the notebook is modified. This way the mappings will still be valid 
after the notebook is changed. However this is not desirable as it adds 
unnecessary complexity to the user, who now has to manage multiple workflows 
for the same project.

Other solutions would likely require modifying or adding more tables in order 
to support multiple notebook versions while maintaining mapping validity. I am 
currently in the process of brainstorming alternative better designs.

GitHub link: https://github.com/apache/texera/discussions/4175

----
This is an automatically sent email for [email protected].
To unsubscribe, please send an email to: [email protected]

Reply via email to