+1 to caching this, this seems a straightforward fix. It seems there should be no need to shell out to a subprocess to get this information, but python packaging is a mess...
On Mon, Jun 9, 2025 at 8:37 AM Kenneth Knowles <k...@apache.org> wrote: > > Perhaps the results of `pip freeze` can be cached to avoid the subprocess on > each unit test? and/or have a way to pass it in to the runner rather than > having the runner call the subprocess (so the cache could then be located in > Pipeline or TestPipeline)? Even though there's lifetime issues to deal with, > if the cache has a conservative key (even the current PID?) it should be > safe. This is definitely something that is owned by the SDK, because the SDK > produces the pipeline proto to be handed to the runner. > > Kenn > > On Thu, Jun 5, 2025 at 10:31 AM Danny McCormick via dev <dev@beam.apache.org> > wrote: >> >> I also don't feel strongly, I'm probably slightly opposed to changing the >> default though. To make the argument for keeping the current behavior: you >> do still get the staged file which is enough to do some debugging (though we >> would probably need to doc this better so people know about it; seems like >> we should find a way to do this regardless), it just doesn't get logged. >> That would have helped me in the past, but it is definitely less useful than >> the whole logging experience. For non-local runners, I also expect the >> penalty is comparatively less when amortized vs other associated costs of >> startup (but still significant). >> >> On Thu, Jun 5, 2025 at 10:18 AM Joey Tran <joey.t...@schrodinger.com> wrote: >>> >>> Would it be worth considering turning it off by default and making it >>> opt-in? I can imagine it's very useful, but a runner needs to implement >>> checking the generated requirements file, right? In which case, the default >>> behavior incurs a cost while not fully implementing the debugging behavior. >>> >>> Another example, the TrivialRunner[1] will run this pip freeze subprocess >>> on every pipeline run but I don't think it checks the generated file. >>> >>> (I don't feel strongly either way FWIW) >>> >>> [1] >>> https://github.com/apache/beam/blob/b7f2e1611556cf2dab7e9a901d3477023cd71294/sdks/python/apache_beam/runners/trivial_runner.py#L47 >>> >>> On Thu, Jun 5, 2025 at 9:59 AM Danny McCormick via dev >>> <dev@beam.apache.org> wrote: >>>> >>>> Thanks for calling this out. I generally agree with you. I've found this >>>> feature to be generally quite useful for production jobs running in >>>> distributed environments. I have seen several issues which have been >>>> solved because of it (and similarly I have seen issues which would have >>>> benefited from it before its introduction). At the same time, I agree it >>>> is not worth the cost when running locally since you're not at the same >>>> risk of diverging environments. >>>> >>>> I'd vote we disable this for Prism and can take that on as part of >>>> enabling prism as the default runner [1]. >>>> >>>> [1] WIP - https://github.com/apache/beam/pull/34612 >>>> >>>> On Wed, Jun 4, 2025 at 10:48 AM Joey Tran <joey.t...@schrodinger.com> >>>> wrote: >>>>> >>>>> Hey all, >>>>> >>>>> We recently upgraded to Beam 2.63 from 2.50. After the upgrade, our unit >>>>> tests testing our runner saw a 4x-5x performance hit. It turned out it >>>>> was because for every pipeline run, the default >>>>> `PipelineRunner.run_pipeline invokes >>>>> `PIpelineRunner.default_environment`[1] which eventually results in a >>>>> subprocess call to `pip freeze` to gather python requirements for later >>>>> logging [2]. >>>>> >>>>> This caught me by surprise and was very hard to debug since the massive >>>>> slowdown was due to using 2x more subprocesses than I specified for my >>>>> unit test runner, which resulted in my python processes thrashing. I've >>>>> turned off this logging feature for our runner, but just wanted to give a >>>>> heads up as any runner that uses the default `run_pipeline` method will >>>>> incur this cost. May be relevant to using the PrismRunner to replace the >>>>> python directrunner (or maybe y'all do want this check?) >>>>> >>>>> Cheers, >>>>> Joey >>>>> >>>>> [1] >>>>> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/runner.py#L182 >>>>> [2] >>>>> https://github.com/apache/beam/blob/dd51c4cba108a0c425c37dfc28a81b3caf80d215/sdks/python/apache_beam/runners/portability/stager.py#L906