mohamedawnallah commented on code in PR #36301: URL: https://github.com/apache/beam/pull/36301#discussion_r2383484852
########## website/www/site/content/en/blog/gsoc-25-ml-connectors.md: ########## @@ -0,0 +1,255 @@ +--- +title: "Google Summer of Code 2025 - Beam ML Vector DB/Feature Store +integrations" +date: 2025-09-26 00:00:00 -0400 +categories: + - blog + - gsoc +aliases: + - /blog/2025/09/26/gsoc-25-ml-connectors.html +authors: + - mohamedawnallah + +--- +<!-- +Licensed under the Apache License, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. +You may obtain a copy of the License at + +http://www.apache.org/licenses/LICENSE-2.0 + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +--> + +## What Will I Cover In This Blog Post? + +I have three objectives in mind when writing this blog post: + +- Documenting the work I've been doing during this GSoC period in collaboration +with the Apache Beam community +- A thoughtful and cumulative thank you to my mentor and the Beam Community +- Writing to an older version of myself before making my first ever contribution +to Beam. This can be helpful for future contributors + +## What Was This GSoC Project About? + +The goal of this project is to enhance Beam's Python SDK by developing +connectors for vector databases like Milvus and feature stores like Tecton. These +integrations will improve support for ML use cases such as Retrieval-Augmented +Generation (RAG) and feature engineering. By bridging Beam with these systems, +this project will attract more users, particularly in the ML community. + +## Why Was This Project Important? + +While Beam's Python SDK supports some vector databases, feature stores and +embedding generators, the current integrations are limited to a few systems as +mentioned in the tables down below. Expanding this ecosystem will provide more +flexibility and richness for ML workflows particularly in feature engineering +and RAG applications, potentially attracting more users, particularly in the ML +community. + +| Vector Database | Feature Store | Embedding Generator | +|----------------|---------------|---------------------| +| BigQuery | Vertex AI | Vertex AI | +| AlloyDB | Feast | Hugging Face | + +## Why Did I Choose Beam As Part of GSoC Among 180+ Orgs? + +I choose to apply to Beam from among 180+ GSoC organizations because it +aligns well with my passion for data processing systems that serve information +retrieval systems and my core career values: + +- **Freedom:** Working on Beam supports open-source development, liberating +developers from vendor lock-in through its unified programming model while +enabling services like +[Project Shield](https://projectshield.withgoogle.com/landing) to protect free +speech globally + +- **Innovation:** Working on Beam allows engagement with cutting-edge data +processing techniques and distributed computing paradigms + +- **Accessibility:** Working on Beam helps build open-source technology that +makes powerful data processing capabilities available to all organizations +regardless of size or resources. This accessibility enables projects like +Project Shield to provide free protection to media, elections, and human rights +websites worldwide + +## What Did I Work On During the GSoC Program? + +During my GSoC program, I focused on developing connectors for vector databases, +feature stores, and embedding generators to enhance Beam's ML capabilities. +Here are the artifacts I worked on and what remains to be done: + +| Type | System | Artifact | +|----------------|--------|----------| +| Enrichment Handler | Milvus | [PR #35216](https://github.com/apache/beam/pull/35216) <br> [PR #35577](https://github.com/apache/beam/pull/35577) <br> [PR #35467](https://github.com/apache/beam/pull/35467) | +| Sink I/O | Milvus | [PR #35708](https://github.com/apache/beam/pull/35708) <br> [PR #35944](https://github.com/apache/beam/pull/35944) | +| Enrichment Handler | Tecton | [PR #36062](https://github.com/apache/beam/pull/36062) | +| Sink I/O | Tecton | [PR #36078](https://github.com/apache/beam/pull/36078) | +| Embedding Gen | OpenAI | [PR #36081](https://github.com/apache/beam/pull/36081) | +| Embedding Gen | Anthropic | To Be Added | + +Here are side-artifacts that are not directly linked to my project: +| Type | System | Artifact | +|------|--------|----------| +| AI Code Review | Gemini Code Assist | [PR #35532](https://github.com/apache/beam/pull/35532) | +| Enrichment Handler | CloudSQL | [PR #34398](https://github.com/apache/beam/pull/34398) | +| Sink I/O | CloudSQL | [PR #35473](https://github.com/apache/beam/pull/35473) | +| Test Infrastructure | GitHub CI | [PR #35655](https://github.com/apache/beam/pull/35655) <br> [PR #35740](https://github.com/apache/beam/pull/35740) <br> [PR #35816](https://github.com/apache/beam/pull/35816) | + +For more granular contributors, checking out my Review Comment: > contributions Addressed -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
