Hi Team, We read about 5k files and executed the pipeline using spark runner.
Each file starts with a new pipeline, followed by various validations and transformations . After that all pipelines are aggregated using PCollectionList followed by flatten operation to create a single pCollection. After that various transformations are performed on this single PCollection. The Spark UI is showing about 100,000 (100 K) tasks. (Very small data for each task) Please suggest if we have any way to decrease number of Bundles \ Partitions like we have RDD Coalesce in Spark. This will help in a significant performance improvement. Thanks & Regards Siddharth Mittal Senior Associate | Sapient Gurgaon SEZ | India
