This is an automated email from the ASF dual-hosted git repository.
xqhu pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/beam.git
The following commit(s) were added to refs/heads/master by this push:
new 37bb1ed58ba Added beam summit 2025 hackathon blog - Pcollectors
(#35541)
37bb1ed58ba is described below
commit 37bb1ed58ba9f3fd30614442a2adb93363cf8e41
Author: Aditya Shukla <[email protected]>
AuthorDate: Wed Jul 9 20:24:03 2025 +0530
Added beam summit 2025 hackathon blog - Pcollectors (#35541)
* Added beam summit 2025 hackathon blog - Pcollectors
* Fix frontmatter: add missing categories field
* Fix
* Fix
* Fix
* Fix formatting errors
* Fix formatting errors
* Fix formatting errors
* Fix formatting errors
* Fix formatting errors
---
.../beam-summit-2025-hackathon-pcollectors-blog.md | 88 ++++++++++++++++++++++
website/www/site/data/authors.yml | 8 ++
2 files changed, 96 insertions(+)
diff --git
a/website/www/site/content/en/blog/beam-summit-2025-hackathon-pcollectors-blog.md
b/website/www/site/content/en/blog/beam-summit-2025-hackathon-pcollectors-blog.md
new file mode 100644
index 00000000000..134bae1b74a
--- /dev/null
+++
b/website/www/site/content/en/blog/beam-summit-2025-hackathon-pcollectors-blog.md
@@ -0,0 +1,88 @@
+---
+title: "Our Experience at Beam College 2025: 1st Place Hackathon Winners"
+date: 2025-07-08
+authors:
+ - ashukla
+ - dkanade
+categories:
+ - blog
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one or more
+contributor license agreements. See the NOTICE file distributed with
+this work for additional information regarding copyright ownership.
+The ASF licenses this file to You under the Apache License, Version 2.0
+(the "License"); you may not use this file except in compliance with
+the License. You may obtain a copy of the License at
+http://www.apache.org/licenses/LICENSE-2.0
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+-->
+
+## Introduction: The Beam of an Idea
+In the world of machine learning for healthcare, preprocessing large pathology
image datasets at scale remains a bottleneck. Whole Slide Images (WSIs) in
medical imaging can reach massive sizes. Traditional Python tools (PIL, etc.)
fail under memory pressure, especially when handling thousands of such
high-resolution images. This becomes a bottleneck for ML modeling tasks using
standard tools.
+
+Having previously worked on image processing for object detection in machine
learning, we also understood how crucial it is to preprocess and structure
image data correctly for downstream tasks. These challenges are non-trivial and
even more critical in healthcare, making it a natural and high-impact use case
for scalable data processing frameworks like Apache Beam.
+
+So, in the [Beam Summit 2025 Hackathon](https://beamcollege.dev/hackathon/),
we joined as team "PCollectors" with the goal to leverage Beam to process large
image data and convert it to a format suitable for downstream ML tasks. We were
amazed to know that we secured 1st place with the implemented solution!
+
+## The Project: Scalable WSI Preprocessing Beam Pipeline
+[GitHub Repo](https://github.com/adityashukla8/medical_image_processing_beam)
+
+### The Goal
+The primary objective of the pipeline was to process patient data (CSV) &
WSIs, extract embeddings, combine the metadata, and output the final dataset in
TFRecord format, ready for large-scale ML training.
+
+### Solution Overview
+Our pipeline processes:
+
+- Patient metadata (CSV)
+- WSI files (.tif)
+- Split the images into “tiles”
+- Extract filtered image tiles based on the background threshold
+- Generate max & avg embeddings per patient using EfficientNet
+- Merge metadata + embeddings into TFRecords
+
+All in a scalable, memory-efficient, cloud-native pipeline using Apache Beam
and Dataflow.
+
+### Dataset
+Source: Mayo Clinic STRIP AI Dataset (Kaggle)
+Metadata: Each row = { image_id, center_id, patient_id, image_num, label }
+Multiple images per patient
+Labels exist only at the patient level
+Images:
+High-res .tif pathology slides
+
+### Tech Stack
+- Apache Beam: Orchestration engine
+- Google Cloud Dataflow: Scalable runner
+- Google Cloud Storage: Input TIFFs + output TFRecords
+- TensorFlow: For embedding generation (EfficientNet) and TFRecord
serialization
+
+## The Hackathon Journey
+Participating in the hackathon introduced us to multiple new things and
allowed us to learn and implement simultaneously. Through the hackathon
weekend, we:
+
+- Designed the end-to-end pipeline
+- Integrated pyvips + openslide for efficient image loading
+- Used Beam's RunInference API with TensorFlow
+- Tiled and filtered images
+- Wrote patient-level embeddings to TFRecords
+
+## What we Learnt
+Apache Beam is really powerful for parallel and cloud-native ML preprocessing.
+Dataflow is the go-to tool when processing large data, like medical images
+
+## What’s Next for The Project
+Looking ahead, the pipeline can be extended beyond fixed-size tiling by
incorporating image segmentation techniques to generate more meaningful patches
based on tissue regions. This approach can improve ML model performance by
focusing only on relevant areas. Moreover, the same preprocessing framework can
be adapted for video data, where frames can be treated as time-indexed image
slices, effectively enabling temporal modeling for time-series tasks such as
motion analysis or progression [...]
+
+**Project Submission Demo**: [Beam Demo -
PCollectors.mp4](https://drive.google.com/file/d/1Os5SvgqHiqfMkoCWOuaVvEPXsnhqXlLx/view?usp=sharing)
+
+## Conclusion
+We are ML Engineers, working at [Intuitive.Cloud](www.intuitive.cloud), where
we play around with large-scale data to build scalable, efficient, dynamic data
processing pipelines that prepare it for downstream ML tasks, with Apache Beam
and Google Cloud DataFlow being the central pieces.
+
+Participating in the hackathon was a great learning opportunity, huge thanks
to the organizers, mentors, and the Apache Beam community!
+
+\- [Aditya Shukla](https://www.linkedin.com/in/adityashukla8/) & [Darshan
Kanade](https://in.linkedin.com/in/darshan-kanade-0797851b3)
+
diff --git a/website/www/site/data/authors.yml
b/website/www/site/data/authors.yml
index 592c1fe800e..543c70974b4 100644
--- a/website/www/site/data/authors.yml
+++ b/website/www/site/data/authors.yml
@@ -118,6 +118,14 @@ msugar:
name: Marcio Sugar
email: [email protected]
twitter:
+ashukla:
+ name: Aditya Shukla
+ email: [email protected]
+ twitter:
+dkanade:
+ name: Darshan Kanade
+ email: [email protected]
+ twitter:
ardagan:
name: Mikhail Gryzykhin
email: [email protected]