I think this actually the same problem as I reported w/ the PubsubIO [1], but in the bounded case. The BoundedSourceAsSDFWrapper closes (and then re-creates) the underlying source each time it checkpoints, and the default behavior is to checkpoint very frequently.
[1] https://lists.apache.org/thread.html/re6b0941a8b4951293a0327ce9b25e607cafd6e45b69783f65290edee%40%3Cdev.beam.apache.org%3E On Fri, Dec 18, 2020 at 11:16 AM Ismaël Mejía <ieme...@gmail.com> wrote: > Hello, > > I was trying to profile some pipeline using Java's direct runner. It > reads ~30 60MB text files (CSV). When I started the profiler it > reported more than 40K instances of TextSource being built which > really surprised me given the small size of the data being processed. > I wonder if I found maybe an issue of over-splitting after we moved to > the SDF based translation that may affect simpler uses. > > I have not gone deeper or created a JIRA because I wanted to ask here > first maybe to see if there is a 'valid' explanation for so many > 'splits'. > > Regards, > Ismaël >