Re: [PR] feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion [texera]

via GitHub Tue, 23 Jun 2026 11:10:08 -0700


zyratlo commented on code in PR #5260:
URL: https://github.com/apache/texera/pull/5260#discussion_r3461913513



##########
frontend/src/app/workspace/service/notebook-migration/migration-prompts.ts:
##########
@@ -0,0 +1,414 @@
+/**
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+
+// TEXERA DOCUMENTATION
+
+// https://github.com/Texera/texera/wiki/Guide-to-Use-a-Python-UDF
+export const TEXERA_OVERVIEW = `
+You are a robust compiler that takes python code and translates it to our 
personal workflow environment Texera that uses python.
+
+  Texera is a data analytics tool that uses workflows to do machine learning 
and data analytics computation. Users are able to drag and drop operators and 
connect their inputs and outputs in a workflow graphical user interface, which 
the code we are going to create.
+
+Texera is able to use Python user defined functions. Documentation of a Python 
UDF in Texera follows:
+  Process Data APIs
+
+There are three APIs to process the data in different units.
+
+  Tuple API.
+
+  class ProcessTupleOperator(UDFOperatorV2):
+
+def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+yield tuple_
+
+Tuple API takes one input tuple from a port at a time. It returns an iterator 
of optional TupleLike instances. A TupleLike is any data structure that 
supports key-value pairs, such as pytexera.Tuple, dict, defaultdict, 
NamedTuple, etc.
+
+  Tuple API is useful for implementing functional operations which are applied 
to tuples one by one, such as map, reduce, and filter.
+
+  Table API.
+
+  class ProcessTableOperator(UDFTableOperator):
+
+def process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:
+yield table
+
+Table API consumes a Table at a time, which consists of all the tuples from a 
port. It returns an iterator of optional TableLike instances. A TableLike is a 
collection of TupleLike, and currently, we support pytexera.Table and 
pandas.DataFrame as a TableLike instance. More flexible types will be supported 
down the road.
+
+  Table API is useful for implementing blocking operations that will consume 
all the data from one port, such as join, sort, and machine learning training.
+
+  Batch API.
+
+  class ProcessBatchOperator(UDFBatchOperator):
+
+BATCH_SIZE = 10
+
+def process_batch(self, batch: Batch, port: int) -> 
Iterator[Optional[BatchLike]]:
+yield batch
+
+Batch API consumes a batch of tuples at a time. Similar to Table, a Batch is 
also a collection of Tuples; however, its size is defined by the BATCH_SIZE, 
and one port can have multiple batches. It returns an iterator of optional 
BatchLike instances. A BatchLike is a collection of TupleLike, and currently, 
we support pytexera.Batch and pandas.DataFrame as a BatchLike instance. More 
flexible types will be supported down the road.
+
+  The Batch API serves as a hybrid API combining the features of both the 
Tuple and Table APIs. It is particularly valuable for striking a balance 
between time and space considerations, offering a trade-off that optimizes 
efficiency.
+
+  All three APIs can return an empty iterator by yield None.
+
+  The template code for a Python UDF follows: MAKE SURE TO USE THE CLASS NAMES 
AND FUNCTIONS DEFINED, THIS IS A MUST FOR THE PROGRAM TO WORK. SELECT 1 OUT OF 
THE 3 PROCESSING OPERATOR FUNCTIONS TO BUILD DEPENDING ON THE CONTEXT OF CODE 
TRANSLATION.
+# Choose from the following templates:
+  #
+# from pytexera import *
+#
+# class ProcessTupleOperator(UDFOperatorV2):
+#
+#     @overrides
+#     def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+#         yield tuple_
+#
+# class ProcessBatchOperator(UDFBatchOperator):
+#     BATCH_SIZE = 10 # must be a positive integer
+#
+#     @overrides
+#     def process_batch(self, batch: Batch, port: int) -> 
Iterator[Optional[BatchLike]]:
+#         yield batch
+#
+# class ProcessTableOperator(UDFTableOperator):
+#
+#     @overrides
+#     def process_table(self, table: Table, port: int) -> 
Iterator[Optional[TableLike]]:
+#         yield table
+`;
+
+// 
https://github.com/Texera/texera/blob/1fa249a9d55d4dcad36d93e093c2faed5c4434f0/core/amber/src/main/python/core/models/tuple.py
+export const TUPLE_DOCUMENTATION = `
+### **<code>Tuple</code> Class Overview**
+
+The \`Tuple\` class is a **lazy-evaluated** data structure designed for 
efficient field storage and access. It provides:
+
+  1. **Support for Multiple Data Sources**:
+* Can be initialized from a \`TupleLike\` object, such as a \`pandas.Series\`, 
\`OrderedDict\`, or another \`Tuple\` instance.
+* Works with \`ArrowTableTupleProvider\` to access \`pyarrow.Table\` data.
+2. **Lazy Field Evaluation**:
+* Field values can be either **directly stored values** or **lazy accessors** 
(\`field_accessor\`).
+* If a field is accessed and is an accessor, it is evaluated and cached.
+3. **Schema (<code>Schema</code>) Enforcement**:
+  * A \`Tuple\` can be created without a schema but can be **finalized** with 
one using \`finalize(schema)\`, which:
+* **Casts field values** (e.g., \`NaN → None\`, \`Object → Bytes\`).
+* **Validates field completeness**, ensuring all fields match the \`Schema\`.
+4. **Pythonic Access Patterns**:
+* **Index-based access**: \`tuple["field_name"]\` or \`tuple[index]\` 
retrieves field values.
+* **Dictionary-like operations**: \`tuple.as_dict()\` returns an 
\`OrderedDict\`, and \`tuple.as_series()\` converts to a \`pandas.Series\`.
+* **Iterable support**: \`for field in tuple\` iterates over field values.
+5. **Hashing and Comparisons**:
+* Implements \`__hash__\` using a Java-like hashing algorithm, allowing usage 
as dictionary keys.
+* Implements \`__eq__\`, supporting equality checks based on field contents.
+6. **Partial Data Extraction**:
+* \`tuple.get_partial_tuple(attribute_names)\` returns a new \`Tuple\` 
instance containing only the specified fields.
+`;
+
+// 
https://github.com/Texera/texera/blob/1fa249a9d55d4dcad36d93e093c2faed5c4434f0/core/amber/src/main/python/core/models/table.py
+export const TABLE_DOCUMENTATION = `### **<code>Table</code> Class Overview**
+
+The \`Table\` class extends \`pandas.DataFrame\`, providing **structured 
Tuple-based data management**. It is designed to integrate seamlessly with 
\`Tuple\` objects.
+
+#### **Key Features:**
+
+1. **Flexible Construction:**
+* Can be initialized from various sources:
+* Another \`Table\` (\`from_table(table)\`)
+* A \`pandas.DataFrame\` (\`from_data_frame(df)\`)
+* A list/iterator of \`TupleLike\` objects (\`from_tuple_likes(tuple_likes)\`)
+* Ensures all \`Tuple\` objects in a \`Table\` have **consistent field names**.
+2. **Tuple Conversion:**
+* \`as_tuples()\`: Converts the table rows into an **iterator of 
<code>Tuple</code> instances**, preserving the row order.
+3. **Equality Comparison (<code>__eq__</code>):**
+* Supports **row-wise equality checks** by comparing the underlying \`Tuple\` 
objects.
+4. **Universal Tuple Output (<code>all_output_to_tuple</code>):**
+* A helper function to convert **various data types** into \`Tuple\` 
iterators, supporting:
+* \`None\` → \`[None]\`
+* \`Table\` → \`as_tuples()\`
+* \`pandas.DataFrame\` → Converted into a \`Table\`, then to Tuples
+* \`List[TupleLike]\` → Converted to \`Tuple\` instances
+* A single \`TupleLike\` or \`Tuple\` → Wrapped in an iterator
+
+#### **Relation to <code>Tuple</code>:**
+
+* \`Table\` **stores multiple <code>Tuple</code> objects** and ensures schema 
consistency across rows.
+* Provides an **efficient bridge** between \`Tuple\`-based data and 
\`pandas.DataFrame\`, enabling compatibility with Python's data analysis tools.
+`;
+
+// 
https://github.com/Texera/texera/blob/42d803310c180978a9f02992f0e05556796b293c/core/amber/src/main/python/core/models/operator.py
+export const OPERATOR_DOCUMENTATION = `### **Operator Class Overview**
+
+The \`Operator\` class is an **abstract base class (ABC)** for all operators, 
defining the fundamental structure for processing \`Tuple\`, \`Batch\`, and 
\`Table\` data in a workflow.
+
+#### **Key Features & Hierarchy**
+
+1. **Base <code>Operator</code> Class**:
+* Defines lifecycle methods: \`open()\` and \`close()\`.
+* Supports a **source flag (<code>is_source</code>)** to distinguish source 
operators from others.
+2. **Tuple-Based Processing (<code>TupleOperatorV2</code>)**:
+* Processes individual \`Tuple\` objects through \`process_tuple(tuple_, 
port)\`.
+* Calls \`on_finish(port)\` when an input port is exhausted.
+3. **Types of Operators**:
+* **SourceOperator**:
+* Produces data via \`produce()\`, yielding \`TupleLike\` or \`TableLike\` 
objects.
+* Overrides \`on_finish(port)\` to output produced data.
+* **BatchOperator**:
+* Collects tuples into batches (\`BATCH_SIZE\`) before processing via 
\`process_batch(batch, port)\`.
+* Converts processed batches (typically \`pandas.DataFrame\`) into \`Tuple\` 
output.
+* **TableOperator**:
+* Collects tuples into a \`Table\` before processing via 
\`process_table(table, port)\`.
+* Converts processed \`Table\` output back into tuples.
+4. **Data Flow & Processing**:
+* Operators receive data **tuple-by-tuple**, **batch-by-batch**, or 
**table-by-table** depending on the type.
+* Results are **iterators** of transformed data (\`TupleLike\`, \`BatchLike\`, 
or \`TableLike\`).
+5. **Deprecated <code>TupleOperator</code>**:
+* The older version of \`TupleOperator\` is deprecated in favor of 
\`TupleOperatorV2\`.
+
+#### Relation to <code>Tuple</code> and <code>Table</code>
+
+* Operators **consume and transform** \`Tuple\` and \`Table\` data within a 
workflow.
+* **Tuple-based operators** process row-wise, while **Table operators** handle 
structured table transformations.
+* **Source operators** initiate the data flow by generating tuples or tables.`;
+
+export const UDF_INPUT_PORT_DOCUMENTATION = `
+Python UDF operators support multiple input and output ports, allowing a 
single operator to receive different types of data from various upstream 
operators. In the process_tuple(self, tuple_: Tuple, port: int) function in 
ProcessTupleOperator and the process_table(self, table: Table, port: int) 
function in ProcessTableOperator, the port parameter indicates the input port. 
The port numbers are assigned in order, starting from 0 to N, from top to 
bottom. When input data have different schemas, it is necessary to assign them 
to different input ports. However, if all input data share the same schema, 
additional ports are not required. In both ProcessTupleOperator and 
ProcessTableOperator, there is an on_finish(self, port: int) function that is 
executed only after all the tuples from the specified port are processed.
+
+Using this knowledge, for situations where multiple upstream UDFs act as input 
to a single UDF, we can introduce an intermediary UDF that collects all of the 
input data and reformats it into a single table, which is then passed as input 
to the original next downstream UDF. When it is necessary for this to occur in 
your translation from notebook to UDFs, include the intermediary UDF and make 
sure that it and the next operator that uses its output is formatted correctly 
and handles the data transfer properly.
+`;
+
+export const EXAMPLE_OF_GOOD_CONVERSION = `
+Here is an example of python code translated into a compatible Texera UDF that 
gives output that abides the output schema compatible with the Texera workflow 
operators for tuples. Other operators do not always follow this strict format, 
but the yielding output structure is important.
+
+Python Code (high level idea): We have a python code that given some data, we 
limit the number of data.
+
+Texera Operator code:
+from pytexera import *
+
+class ProcessTupleOperator(UDFOperatorV2):
+def __init__(self):
+self.limit = 10
+self.count = 0
+@overrides
+def process_tuple(self, tuple_: Tuple, port: int) -> 
Iterator[Optional[TupleLike]]:
+if(self.count < self.limit):
+self.count += 1
+yield tuple_
+
+`;
+
+export const VISUALIZER_DOCUMENTATION = `
+Texera requires a unique way of generating visualizations from ML libraries:
+1. Ensures one yield per operator (per Texera’s UDF constraints).
+2. Uses Plotly for visualization and outputs results as embeddable HTML.
+3. Error handling is built-in to notify users when data is missing.
+`;
+
+export const EXAMPLE_OF_MULTIPLE_UDF_CONVERSION = `
+Here is an example of breaking up python code into multiple Texera UDFs. 
Format your response structure exactly like the given example. The "code" key 
contains a dictionary of the UDF ID's with their respective code. The "edges" 
key contains a list of pairs that contains the connections between UDFs. The 
"outputs" key contains a dictionary of the UDF ID's with a list of the output 
column names of the DataFrame that the UDF yields. The UDFs can branch and 
merge, it does not have to be a linear chain depending on your implementation.
+
+Original Code:
+\`\`\`python
+# START CELL1
+import pandas as pd
+from sklearn.model_selection import train_test_split
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.svm import SVC
+from sklearn.tree import DecisionTreeClassifier
+from sklearn.linear_model import LogisticRegression
+from sklearn.metrics import accuracy_score
+from sklearn.preprocessing import StandardScaler
+import matplotlib.pyplot as plt
+# END CELL1
+
+# START CELL2
+# Load the dataset
+file_path = 'diabetes.csv'
+data = pd.read_csv(file_path)
+# END CELL2
+
+# START CELL3
+# Remove duplicate rows
+data = data.drop_duplicates()
+
+# Remove rows with null values
+data = data.dropna()
+# END CELL3
+
+# START CELL4
+# Print the minimum, maximum, and mean for all fields
+print("Minimum values:\n", data.min())
+print("\nMaximum values:\n", data.max())
+print("\nMean values:\n", data.mean())
+# END CELL 4

Review Comment:
   Fixed in 
[4dc3b1b](https://github.com/apache/texera/pull/5260/commits/4dc3b1bc0cbd4ad1b834b89200287299b6430b98)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] feat(python-notebook-migration): add LLM client for notebook-to-workflow conversion [texera]

Reply via email to