Let us add some more detail to DFD diagram Data for the Entire Pipeline as
attached
Mich Talebzadeh,
Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
London <https://en.wikipedia.org/wiki/Imperial_College_London>
London, United Kingdom
view my Linkedin profile
<https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
*Disclaimer:* The information provided is correct to the best of my
knowledge but of course cannot be guaranteed . It is essential to note
that, as with any advice, quote "one test result is worth one-thousand
expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
On Fri, 3 Jan 2025 at 21:47, Mich Talebzadeh <[email protected]>
wrote:
> well I can given you some advice using Google Cloud tools
>
> Level 0: High-Level Overview
>
>
> 1. Input: Raw data in Google Cloud Storage (GCS).
> 2. Processing:
>
>
> - Pre-processing with Dataproc (Spark on tin box)
> - Inference with LLM (Cloud Run/Vertex AI).
> - Post-processing with Dataproc (Spark)
>
> 3. Output: Final processed dataset stored in GCS or Google BigQuery DW
>
> Level 1: Detailed Data Flow
>
> 1.
>
> *Step 1: Pre-Processing*
> - Input: Raw data from GCS.
> - Process:
> - Transform raw data using Spark on *Dataproc.*
> - Output: Pre-processed data stored back in *GCS.*
> 2.
>
> *Step 2: LLM Inference*
> - Input: Pre-processed data from GCS.
> - Process:
> - Data sent in batches to *LLM Inference Service(*for
> processing, pre-processed data) hosted on *Cloud Run/Vertex AI.*
> - LLM generates inferences for each batch.
> - Output: LLM-inferred results stored in* GCS.*
> 3.
>
> *Step 3: Post-Processing*
> - Input: LLM-inferred results from* GCS*.
> - Process:
> - Additional transformations, aggregations, or merging with
> other datasets using Spark on *Dataproc.*
> - Output: Final dataset stored in* GCS *or loaded into *Google BigQuery
> DW
> *for downstream ML training.
>
> *Orchestration *
>
> Use *Cloud Compose*r that sits on top of *Apache Airflow* or just Airflow
> itself
>
> *Monitoring*
>
> - Job performance -> Dataproc
> - LLM API throughput -> Cloud Run/Vertex AI.
> - Storage and data transfer metrics -> GCS
> - Google logs
>
> *Notes*
> The LLM-inferenced results are the predictions, insights, or
> transformations performed by the LLM on input data.These results are the
> outputs of the model’s reasoning, natural language understanding, or
> processing capabilities applied to the input.
>
> HTH
>
> Mich Talebzadeh,
>
> Architect | Data Science | Financial Crime | Forensic Analysis | GDPR
> PhD <https://en.wikipedia.org/wiki/Doctor_of_Philosophy> Imperial College
> London <https://en.wikipedia.org/wiki/Imperial_College_London>
> London, United Kingdom
>
>
> view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>
> *Disclaimer:* The information provided is correct to the best of my
> knowledge but of course cannot be guaranteed . It is essential to note
> that, as with any advice, quote "one test result is worth one-thousand
> expert opinions (Werner <https://en.wikipedia.org/wiki/Wernher_von_Braun>Von
> Braun <https://en.wikipedia.org/wiki/Wernher_von_Braun>)".
>
>
> On Fri, 3 Jan 2025 at 13:08, Mayur Dattatray Bhosale <[email protected]>
> wrote:
>
>> Hi team,
>>
>> We are planning to use Spark for pre-processing the ML training data
>> given the data is 500+ TBs.
>>
>> One of the steps in the data-preprocessing requires us to use a LLM (own
>> deployment of model). I wanted to understand what is the right way to
>> architect this? These are the options that I can think of:
>>
>> - Split this into multiple applications at the LLM use case step. Use a
>> workflow manager to feed the output of the application-1 to LLM and feed
>> the output of LLM to application 2
>> - Split this into multiple stages by writing the orchestration code of
>> feeding output of the pre-LLM processing stages to externally hosted LLM
>> and vice versa
>>
>> I wanted to know if within Spark there is an easier way to do this or any
>> plans of having such functionality as a first class citizen of Spark in
>> future? Also, please suggest any other better alternatives.
>>
>> Thanks,
>> Mayur
>>
>
"DFD for entire pipeline"
[Raw Data in GCS] --> [Pre-Processing (Dataproc)] --> [Pre-Processed Data in
GCS] -->
--> [LLM Inference Service] --> [LLM Results in GCS] --> [Post-Processing
(Dataproc)] -->
--> [Final Dataset in GCS/BigQuery]
1) High level DFD
[Raw Data in GCS]
|
|
V
[Pre-Processing (Dataproc)]
- Reads raw data from GCS
- Filters, aggregates, and formats data
|
|
V
2) Pre-Processing Stage DFD
[Pre-Processed Data in GCS]
- Data Sources: Raw data files stored in GCS.
- Processes: Spark-based transformations.
- Output: Pre-processed data in GCS.
[Pre-Processed Data in GCS]
|
|
V
3) [LLM Inference Service]
- Batch data read from GCS
- Sends data to LLM for inference
- Receives inference results
|
|
V
[LLM Results in GCS]
4) Post-Processing Stage DFD
[LLM Results in GCS]
|
|
V
[Post-Processing (Dataproc)]
- Reads inference results
- Merges with other datasets or performs additional transformations
|
|
V
[Final Dataset in GCS/BigQuery]
- Data Sources: LLM results stored in GCS.
- Processes: Spark-based processing for final dataset preparation.
- Output: Final dataset in GCS or BigQuery.
5) Additional Details
Interactions with GCS:
- Data is read from and written to GCS at multiple stages to ensure
scalability and persistence.
- Each processing stage works on batches of data, leveraging partitioning and
optimized file formats like Parquet.
- Parallelization with Spark:
- Spark parallelizes both pre-processing and post-processing to handle the
500+ TB dataset efficiently.
LLM Service:
- Hosted on Cloud Run/Vertex AI to scale horizontally.
- Accepts batches of data for inference and processes asynchronously.
---------------------------------------------------------------------
To unsubscribe e-mail: [email protected]