Re: [Distutils] reproducible builds
On 20 March 2017 at 23:34, Thomas Kluyver wrote: > On Mon, Mar 20, 2017, at 01:02 PM, Robin Becker wrote: > > I guess the algorithm variation across pythons would make dictionary > order quite variable. > > For a Python based tool, I think it's reasonable that reproducing a > build requires running with the same version of Python. > > The requirement would be that, with enough information about the build > environment, you *can* produce an identical PDF. It needn't (AFAIK) be > identical every time anyone builds it. > Right, one of the other aspects of reproducible-builds is looking into ways to define and distribute build environments in addition to the application source code: https://reproducible-builds.org/docs/definition-strategies/ Within a given binary context (e.g. Debian packages), that may be a text description, like Debian's buildinfo files: https://wiki.debian.org/ReproducibleBuilds/BuildinfoFiles For Fedora/RHEL/CentOS, the equivalent would probably be to extract a suitable config from the build system: https://fedoraproject.org/wiki/Using_the_Koji_build_system#Using_koji_to_generate_a_mock_config_to_replicate_a_buildroot In other cases, the build environment may itself by a binary artifact (e.g. the manylinux1 container images, or the "Holy Build Box" machine images). Fully eliminating non-determinism usually does requiring switching to explicit sorting and ordered containers in build tools and scripts, as otherwise even things like directory listings or JSON serialisation can introduce variations in output when a build is run on a different machine. The reproducible-builds project offers some interesting tools to identify and analyse cases of non-reproducible outputs: https://reproducible-builds.org/tools/ However, nobody can reasonably expect arbitrary upstream projects (especially volunteer run ones) to be going out and pre-emptively solving that kind of problem - the most it's realistic to aim for is to encourage projects to be accommodating when upstream changes are proposed to introduce more determinism into the build processes for particular projects, as well as into the artifact generation process for tools that may be used as part of the build process for other projects. (And I agree with Thomas that it's likely the latter case that applies for reportlab-generated PDFs) Cheers, Nick. P.S. Prompted by Gary Berhnhardt, one of the ways I've started thinking about the whole question of "built artifacts" in general is as a complex distributed caching problem, with reproducible builds being a way of ensuring that it's possible to check the validity of particular cache entries -- Nick Coghlan | ncogh...@gmail.com | Brisbane, Australia ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
On Mon, Mar 20, 2017, at 01:02 PM, Robin Becker wrote: > Well now I am confused. The date / times mentioned in the debian patch > are those > we force into the documents produced by the reportlab package when it is > used. > > They would not normally be part of the package itself. Although the > reportlab > documentation is available in the source I'm fairly sure we don't include > it in > the wheels. I'm guessing, but I imagine that Debian may be using reportlab in the builds of other packages, to build documentation. It's normal for Debian packages to include built docs, unlike wheels. So they would want it to create PDFs reproducibly, but the PDFs generated in your test suite probably don't matter. > I guess the algorithm variation across pythons would make dictionary order > quite variable. For a Python based tool, I think it's reasonable that reproducing a build requires running with the same version of Python. The requirement would be that, with enough information about the build environment, you *can* produce an identical PDF. It needn't (AFAIK) be identical every time anyone builds it. Thomas ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
On 20/03/2017 11:35, Thomas Kluyver wrote: On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote: Obviously if I have the ability to embed repr(some_object) into the document output then it will vary (unless the underlying python is reproducible). I'm not sure if debian runs the whole reportlab test suite, but it makes sense to get this kind of variablity out. AIUI, it's fine to have the *ability* to produce non-deterministic output, and it doesn't matter if your tests do that. The aim of reproducible builds is to be able to go from the same source code to an identical binary package. Documents generated by running the tests are presumably not included in binary packages, so it doesn't matter if they change. Well now I am confused. The date / times mentioned in the debian patch are those we force into the documents produced by the reportlab package when it is used. They would not normally be part of the package itself. Although the reportlab documentation is available in the source I'm fairly sure we don't include it in the wheels. Of course if the debian packaging includes output created by reportlab then that document would receive the current (ie variable) time. In addition any random behaviour created by the reportlab generation code would also be embedded in the document. If the debian variable is intended create reproducible PDF as part of their packaging of reportlab or some other package then I'm fairly sure that other variation will need to be checked in addition to the control that the SOURCE_DATE_EPOCH variable would give. Perhaps Matthias could comment; I know little about how the debian packaging works. I believe there was some way to modify the hashing introduced when the dos dictionary attacks were an issue. The PYTHONHASHSEED environment variable: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED If you have non-determinism introduced by Python hashing, setting a constant value of PYTHONHASHSEED should be an easy way to work around it. Well years ago we tried to get some random behaviour in text selection by setting a seed value eg 23..22 (but that doesn't work across pythons). I guess the algorithm variation across pythons would make dictionary order quite variable. C:\Users\rptlab>\python27\python Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. import random random.seed(23..22) from random import randint, choice randint(10,25) 15 C:\Users\rptlab>\python36\python Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit (AMD64)] on win32 Type "help", "copyright", "credits" or "license" for more information. import random random.seed(23..22) from random import randint, choice randint(10,25) 21 -- Robin Becker ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
As Thomas mentioned PYTHONHASHSEED is sufficient to solve non-determinism by the hashing. In my experience this hashing, along with datetimes (e.g. in the bytecode) are typically the only causes of non-determinism in Python packages. Someone from I think Debian did mention [1] that they cannot always set PYTHONHASHSEED and so in certain cases they apply patches to fix non-determinism. This is what they might be after in the case of `reportlab` but you best ask them. I'm not yet sure what to think of that patching approach. E.g., if one couldn't set PYTHONHASHSEED when building the bytecode in the interpreter itself, then one would have to convert all sets to lists with potential negative performance effects. On Mon, Mar 20, 2017 at 12:35 PM, Thomas Kluyver wrote: > On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote: > > Obviously if I have the ability to embed repr(some_object) > > into the document output then it will vary (unless the underlying python > > is reproducible). I'm not sure if debian runs the whole reportlab test > > suite, but it makes sense to get this kind of variablity out. > > AIUI, it's fine to have the *ability* to produce non-deterministic > output, and it doesn't matter if your tests do that. The aim of > reproducible builds is to be able to go from the same source code to an > identical binary package. Documents generated by running the tests are > presumably not included in binary packages, so it doesn't matter if they > change. > > > I believe there was some way to modify the hashing introduced when the > dos dictionary attacks were an issue. > > The PYTHONHASHSEED environment variable: > https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED > > If you have non-determinism introduced by Python hashing, setting a > constant value of PYTHONHASHSEED should be an easy way to work around > it. > ___ > Distutils-SIG maillist - Distutils-SIG@python.org > https://mail.python.org/mailman/listinfo/distutils-sig > ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote: > Obviously if I have the ability to embed repr(some_object) > into the document output then it will vary (unless the underlying python > is reproducible). I'm not sure if debian runs the whole reportlab test > suite, but it makes sense to get this kind of variablity out. AIUI, it's fine to have the *ability* to produce non-deterministic output, and it doesn't matter if your tests do that. The aim of reproducible builds is to be able to go from the same source code to an identical binary package. Documents generated by running the tests are presumably not included in binary packages, so it doesn't matter if they change. > I believe there was some way to modify the hashing introduced when the dos > dictionary attacks were an issue. The PYTHONHASHSEED environment variable: https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED If you have non-determinism introduced by Python hashing, setting a constant value of PYTHONHASHSEED should be an easy way to work around it. ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
On 18/03/2017 07:20, Nick Coghlan wrote: ... While the reproducible builds effort started in Debian and is furthest advanced there, it's not distro specific - interested developers working on other distros were already looking into it, and the Core Infrastructure Initiative has backed it as one of their security assurance initiatives. Software Freedom Conservancy have a decent write-up on the current state of things after December's Reproducible Builds Summit: https://sfconservancy.org/blog/2016/dec/26/reproducible-builds-summit-report/ thanks for this; it seems the emphasis is on security. If the intent is that reportlab should be able to reliably reproduce the same binary output then I think I need to do more than just fix a couple of dates. We use many dictionary like objects to produce PDF and I am not sure all are sorted by key during output. Is there a way to excite dictionary ordering changes? I believe there was some way to modify the hashing introduced when the dos dictionary attacks were an issue. Would it be sufficient to generate documents with say Python 2.7 and check against 3.6? However, you'll probably want to make yourself a helper function that uses SOURCE_DATE_EPOCH if defined, and falls back to the current time otherwise. That way you'll get reproducible behaviour when a build system configures the setting, while retaining your current behaviour for environments that don't. good advice and that's what I am doing. Cheers, Nick. P.S. A question well worth asking for *us* is whether or not setting SOURCE_DATE_EPOCH appropriately (if it isn't already set in the current environment) should be part of the build system abstraction PEPs. -- Robin Becker ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig
Re: [Distutils] reproducible builds
On 17/03/2017 17:49, David Wilson wrote: Hey Robin, What happens if other distros decide not to use this environment variable? Do I really want distro specific code in the package? AFAIK this is seeing a great deal of use outside of Debian and even Linux, for instance GCC also supports this variable. In short where does the distro responsibility and package maintainers boundary need to be? I guess it mostly comes down to whether you'd like them to carry the debt of a vendor patch to implement the behaviour for you in a way you don't like, or you'd prefer to retain full control. :) So it's more a preference than a responsibility. David . I think I accept the need to support this variable. Our original use case was for testing purposes where we altered dates injected into the produced pdf meta data and also in some cases the content. However, if that is the implied intent of the debian variable then I will also need to modify the behaviour of some other tests eg in one case the produced pdf output looks like this The value of i is not larger than 3 The value of i is equal to 3 The value of i is not less than 3 The value of i is 3 The value of i is 2 The value of i is 1 {'doc': , 'currentFrame': 'normal', 'currentPageTemplate': 'First', 'aW': 439.27559055118104, 'aH': 685.8897637795275, 'aWH': (439.27559055118104, 685.8897637795275), 'i': 0, 'availableWidth': 439.27559055118104, 'availableHeight': 619.8897637795275} The current page number is 1 ie we are introspecting internals and injecting that into the document content. I imagine I need to clean up the reporting to avoid getting addresses etc etc into the documents. Obviously if I have the ability to embed repr(some_object) into the document output then it will vary (unless the underlying python is reproducible). I'm not sure if debian runs the whole reportlab test suite, but it makes sense to get this kind of variablity out. When we make significant changes to existing behaviours our current workflow consists of generating a large number of outputs and then rendering them into jpeg pages with ghost script. Differences in the jpegs can be used to spot problems. -- Robin Becker ___ Distutils-SIG maillist - Distutils-SIG@python.org https://mail.python.org/mailman/listinfo/distutils-sig