aicam opened a new issue, #4198:
URL: https://github.com/apache/texera/issues/4198

   ### Feature Summary
   
   ## Description
   We propose enabling a standardized experience for users to bring and utilize 
their own Machine Learning (ML) models within the Texera platform. To achieve 
this, we need to adopt a unified protocol for the entire lifecycle of model 
saving, loading, and execution.
   
   After evaluating several standards, we recommend integrating **MLflow** as 
the core protocol for model management in Texera.
   
   ## Motivation & User Persona
   Texera serves two primary user groups with distinct needs:
   1.  **Students:** Who use the platform to learn the fundamentals of Machine 
Learning and Data Science.
   2.  **Biomedical Engineers:** Who require heavy computation for tasks such 
as sequence alignment and "shallow" machine learning (e.g., Scikit-Learn, 
classic statistical models).
   
   Currently, there is no standardized way for these users to import and run 
pre-trained models seamlessly. Implementing a standard protocol will streamline 
this workflow and enhance Texera's extensibility.
   
   ## Evaluation of Alternatives
   We explored several options before selecting MLflow:
   
   * **Hugging Face:**
       * *Pros:* Excellent standards and ease of use; industry standard for 
LLMs.
       * *Cons:* Primarily focused on LLMs and Deep Learning. It does not offer 
a comprehensive solution for managing the full lifecycle (storage to loading) 
of general-purpose or "shallow" ML models often used by our target audience.
   * **ONNX (Open Neural Network Exchange):**
       * *Pros:* Great interoperability for deep learning models.
       * *Cons:* Heavily focused on Neural Networks, making it less suitable 
for the broad range of general ML libraries (like Scikit-Learn) that our 
biomedical users rely on.
   * **MLflow (Selected):**
       * *Pros:* Supports a wide variety of libraries including TensorFlow, 
PyTorch, and Scikit-Learn. Crucially, it manages the *entire* lifecycle from 
standardizing the storage format to loading the model for inference.
   
   
   
   ### Proposed Solution or Design
   
   ## Proposed Implementation
   The integration will leverage two existing architectural features within 
Texera:
   
   ### 1. Model Storage (via LakeFS)
   * We will utilize our existing **LakeFS** integration to store MLflow 
artifacts.
   * Models will be stored similarly to how we handle datasets, but with a key 
difference: we will enforce the MLflow protocol/structure on the files during 
upload to ensure compatibility.
   
   ### 2. Model Execution (New Operator)
   * We will introduce a new operator type: `MLflow`.
   * This will be built upon our existing **Python Native Operator** 
infrastructure.
   * The operator will automatically handle loading the model using the 
standard `mlflow` library and executing inference against the input data stream.
   
   
![Image](https://github.com/user-attachments/assets/f7859e35-d7aa-4f44-bcc7-17da486494a6)
   
   
![Image](https://github.com/user-attachments/assets/06cf1a88-4d9a-4502-ba22-abe90eec3468)
   
   ### Impact / Priority
   
   (P2)Medium – useful enhancement
   
   ### Affected Area
   
   Workflow Engine (Amber)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to