Hi Team,

We read about 5k files and executed the pipeline using spark runner.

Each file starts with a new pipeline, followed by various validations and 
transformations . After that all pipelines are aggregated using PCollectionList 
followed by flatten operation to create a single pCollection.

After that various transformations are performed on this single PCollection.

The Spark UI is showing about 100,000  (100 K) tasks. (Very small data for each 
task)

Please suggest if we have any way to decrease number of Bundles \ Partitions 
like we have RDD Coalesce in Spark.

This will help in a significant performance improvement.

Thanks & Regards

Siddharth Mittal
Senior Associate | Sapient
Gurgaon SEZ | India

Reply via email to