On 1/30/2013 11:03 PM, Ehsan Akhgari wrote:
> (Follow-ups to dev-platform, please)
>
> Dear all,
>
> This email summarizes the results of our investigation on our options
> with regard to the future of PGO optimizations on Windows.  I will
> first describe the work that happened as part of the investigation,
> and will then propose a set of options on what solutions are available
> to us.  If you're interested in the tl;dr version, please scroll to
> the bottom. For the details, see the dependencies of bug 833881.
>
> (Note that we're only talking about PGO for libxul.  Anything outside
> of libxul, specifically the JS engine, is not going to be affected by
> the decision coming out of this thread.  And obviously, this
> discussion is only about Windows.)
>
> The first thing that we tried to investigate was whether or not
> upgrading to Visual Studio 2012 Update 1 makes the memory usage of the
> PGO linker drop down by a significant amount.  Thanks to the
> investigation done by jimm, we know that it will actually increase the
> memory usage, and therefore is not an option.
>
> Then, we tried to see how much breathing room we're going to have if
> we disabled PGO but not link-time code generation (LTCG), and if we
> disable them both together.  It turns out that disabling PGO but
> keeping LTCG enabled reduces the memory usage by ~200MB, which means
> that it's not an effective measure.  Disabling both LTCG and PGO
> brings down the linker's virtual memory usage to around 1GB, which
> means that we will not hit the maximum virtual memory size of 4GB for
> a *long* time.  (Unfortunately, the Microsoft toolchain cannot perform
> PGO builds without LTCG.) Therefore, for the rest of this email, I
> will talk about disabling both PGO and LTCG.
>
> We then tried to get a sense of how much of a win the PGO
> optimizations are.  Thanks to a series of measurements by dmandelin,
> we know that disabling PGO/LTCG will result in a regression of about
> 10-20% on benchmarks which examine DOM and layout performance such as
> Dromaeo and guimark2 (and 40% in one case), but no significant
> regressions in the startup time, and gmail interactions.  Thanks to a
> series of telemetry measurements performed by Vladan on a Nightly
> build we did last week which had PGO/LTCG disabled, there are no
> telemetry probes which show a significant regression on builds without
> PGO/LTCG.  Vladan is going to try to get this data out of a Tp5 run
> tomorrow as well, but we don't have any evidence to believe that the
> results of that experiments will be any different.
>
>
> Given the above, I'd like to propose the following long-term solutions:
>
> 1. Disable PGO/LTCG now.  The downsides are that we should take a hit
> in microbenchmarks, specifically Dromaeo.  But we have no reason to
> believe that is going to affect any of the performance characteristics
> observed by our users.  And it means that engineers can stop worrying
> about this problem once and for all.
>
> 2. Try to delay disabling PGO/LTCG as much as possible.  Given the
> tracking implemented in bug 710840, we can now watch those graphs so
> that we know when this problem is going to hit next, and come up with
> a mitigation strategy.  In order to effectively implement this
> solution, we're going to need:
>   * A person to own watching the graphs and report back when we step
> inside "the danger zone" again.
>   * A detailed plan of action on what we'll do to mitigate this
> problem the next time as opposed to acting on a firedrill.  One
> possible plan of action could be disabling PGO for everything except
> content/dom/layout/xpcom/gfx, no questions asked.
>   * A group of engineers to own performing the above action.
>   * Going back through the historical data over the past year,
> determine the causes behind the large spikes in the gradual memory
> usage increase, and find solutions to them to buy as much time as
> possible.
>
> 3. Try to delay disabling PGO/LTCG until the next time that we hit the
> limit, and disable PGO/LTCG then once and for all.  In order to
> implement this solution, we're going to need:
>   * A person to own watching the graphs and report back when we step
> inside the danger zone again.
>   * A build-system patch which makes it possible to disable PGO/LTCG
> for libxul by toggling a switch.
>   * Clear documentation on what that switch is, so that anybody can
> toggle it when we need to take action the next time.
>
>
> I think given the information that we currently have, the best course
> of action is #3, followed by #1 and #2.  I'd like to explicitly
> recommend against #2, because I don't think we have the evidence to
> support that spending that much effort will bring any noticeable gains
> to our users.  This effort is better spent elsewhere.

After consideration, I think we ought to just bite the bullet and
disable PGO. We have no other way to fix this issue. All other work we
can do simply pushes it down the road. As our recent history has shown,
we simply don't have the ability to fix this in any long-term sense. If
Microsoft doesn't fix their toolchain, there's nothing we can do.

Related, I think we ought to seriously investigate funding work on
making clang a viable toolchain for building Firefox on Windows. Having
a non-open toolchain makes compiler bugs and limitations much more
painful, where this PGO issue is the extreme example.

-Ted

_______________________________________________
dev-platform mailing list
dev-platform@lists.mozilla.org
https://lists.mozilla.org/listinfo/dev-platform

Reply via email to