Re: [Distutils] reproducible builds

2017-03-20 Thread Nick Coghlan
On 20 March 2017 at 23:34, Thomas Kluyver  wrote:

> On Mon, Mar 20, 2017, at 01:02 PM, Robin Becker wrote:
> > I guess the algorithm variation across pythons would make dictionary
> order quite variable.
>
> For a Python based tool, I think it's reasonable that reproducing a
> build requires running with the same version of Python.
>
> The requirement would be that, with enough information about the build
> environment, you *can* produce an identical PDF. It needn't (AFAIK) be
> identical every time anyone builds it.
>

Right, one of the other aspects of reproducible-builds is looking into ways
to define and distribute build environments in addition to the application
source code: https://reproducible-builds.org/docs/definition-strategies/

Within a given binary context (e.g. Debian packages), that may be a text
description, like Debian's buildinfo files:
https://wiki.debian.org/ReproducibleBuilds/BuildinfoFiles

For Fedora/RHEL/CentOS, the equivalent would probably be to extract a
suitable config from the build system:
https://fedoraproject.org/wiki/Using_the_Koji_build_system#Using_koji_to_generate_a_mock_config_to_replicate_a_buildroot

In other cases, the build environment may itself by a binary artifact (e.g.
the manylinux1 container images, or the "Holy Build Box" machine images).

Fully eliminating non-determinism usually does requiring switching to
explicit sorting and ordered containers in build tools and scripts, as
otherwise even things like directory listings or JSON serialisation can
introduce variations in output when a build is run on a different machine.
The reproducible-builds project offers some interesting tools to identify
and analyse cases of non-reproducible outputs:
https://reproducible-builds.org/tools/

However, nobody can reasonably expect arbitrary upstream projects
(especially volunteer run ones) to be going out and pre-emptively solving
that kind of problem - the most it's realistic to aim for is to encourage
projects to be accommodating when upstream changes are proposed to
introduce more determinism into the build processes for particular
projects, as well as into the artifact generation process for tools that
may be used as part of the build process for other projects. (And I agree
with Thomas that it's likely the latter case that applies for
reportlab-generated PDFs)

Cheers,
Nick.

P.S. Prompted by Gary Berhnhardt, one of the ways I've started thinking
about the whole question of "built artifacts" in general is as a complex
distributed caching problem, with reproducible builds being a way of
ensuring that it's possible to check the validity of particular cache
entries

-- 
Nick Coghlan   |   ncogh...@gmail.com   |   Brisbane, Australia
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Thomas Kluyver
On Mon, Mar 20, 2017, at 01:02 PM, Robin Becker wrote:
> Well now I am confused. The date / times mentioned in the debian patch
> are those 
> we force into the documents produced by the reportlab package when it is
> used.
> 
> They would not normally be part of the package itself. Although the
> reportlab 
> documentation is available in the source I'm fairly sure we don't include
> it in 
> the wheels.

I'm guessing, but I imagine that Debian may be using reportlab in the
builds of other packages, to build documentation. It's normal for Debian
packages to include built docs, unlike wheels. So they would want it to
create PDFs reproducibly, but the PDFs generated in your test suite
probably don't matter.

> I guess the algorithm variation across pythons would make dictionary order 
> quite variable.

For a Python based tool, I think it's reasonable that reproducing a
build requires running with the same version of Python.

The requirement would be that, with enough information about the build
environment, you *can* produce an identical PDF. It needn't (AFAIK) be
identical every time anyone builds it.

Thomas
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Robin Becker

On 20/03/2017 11:35, Thomas Kluyver wrote:

On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote:

Obviously if I have the ability to embed  repr(some_object)
into the document output then it will vary (unless the underlying python
is reproducible). I'm not sure if debian runs the whole reportlab test
suite, but it makes sense to get this kind of variablity out.


AIUI, it's fine to have the *ability* to produce non-deterministic
output, and it doesn't matter if your tests do that. The aim of
reproducible builds is to be able to go from the same source code to an
identical binary package. Documents generated by running the tests are
presumably not included in binary packages, so it doesn't matter if they
change.



Well now I am confused. The date / times mentioned in the debian patch are those 
we force into the documents produced by the reportlab package when it is used.


They would not normally be part of the package itself. Although the reportlab 
documentation is available in the source I'm fairly sure we don't include it in 
the wheels.


Of course if the debian packaging includes output created by reportlab then that 
document would receive the current (ie variable) time. In addition any random 
behaviour created by the reportlab generation code would also be embedded in the 
document.


If the debian variable is intended create reproducible PDF as part of their 
packaging of reportlab or some other package then I'm fairly sure that other 
variation will need to be checked in addition to the control that the 
SOURCE_DATE_EPOCH variable would give. Perhaps Matthias could comment; I know 
little about how the debian packaging works.



 I believe there was some way to modify the hashing introduced when the dos 
dictionary attacks were an issue.


The PYTHONHASHSEED environment variable:
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

If you have non-determinism introduced by Python hashing, setting a
constant value of PYTHONHASHSEED should be an easy way to work around
it.



Well years ago we tried to get some random behaviour in text selection by 
setting a seed value eg 23..22 (but that doesn't work across  pythons). I 
guess the algorithm variation across pythons would make dictionary order quite 
variable.




C:\Users\rptlab>\python27\python
Python 2.7.13 (v2.7.13:a06454b1afa1, Dec 17 2016, 20:53:40) [MSC v.1500 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import random
random.seed(23..22)
from random import randint, choice
randint(10,25)

15






C:\Users\rptlab>\python36\python
Python 3.6.0 (v3.6.0:41df79263a11, Dec 23 2016, 08:06:12) [MSC v.1900 64 bit 
(AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.

import random
random.seed(23..22)
from random import randint, choice
randint(10,25)

21




--
Robin Becker
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Freddy Rietdijk
As Thomas mentioned PYTHONHASHSEED is sufficient to solve non-determinism
by the hashing. In my experience this hashing, along with datetimes (e.g.
in the bytecode) are typically the only causes of non-determinism in Python
packages.

Someone from I think Debian did mention [1] that they cannot always set
PYTHONHASHSEED and so in certain cases they apply patches to fix
non-determinism. This is what they might be after in the case of
`reportlab` but you best ask them.

I'm not yet sure what to think of that patching approach. E.g., if one
couldn't set PYTHONHASHSEED when building the bytecode in the interpreter
itself, then one would have to convert all sets to lists with potential
negative performance effects.

On Mon, Mar 20, 2017 at 12:35 PM, Thomas Kluyver 
wrote:

> On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote:
> > Obviously if I have the ability to embed  repr(some_object)
> > into the document output then it will vary (unless the underlying python
> > is reproducible). I'm not sure if debian runs the whole reportlab test
> > suite, but it makes sense to get this kind of variablity out.
>
> AIUI, it's fine to have the *ability* to produce non-deterministic
> output, and it doesn't matter if your tests do that. The aim of
> reproducible builds is to be able to go from the same source code to an
> identical binary package. Documents generated by running the tests are
> presumably not included in binary packages, so it doesn't matter if they
> change.
>
> >  I believe there was some way to modify the hashing introduced when the
> dos dictionary attacks were an issue.
>
> The PYTHONHASHSEED environment variable:
> https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED
>
> If you have non-determinism introduced by Python hashing, setting a
> constant value of PYTHONHASHSEED should be an easy way to work around
> it.
> ___
> Distutils-SIG maillist  -  Distutils-SIG@python.org
> https://mail.python.org/mailman/listinfo/distutils-sig
>
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Thomas Kluyver
On Mon, Mar 20, 2017, at 09:00 AM, Robin Becker wrote:
> Obviously if I have the ability to embed  repr(some_object) 
> into the document output then it will vary (unless the underlying python
> is reproducible). I'm not sure if debian runs the whole reportlab test
> suite, but it makes sense to get this kind of variablity out.

AIUI, it's fine to have the *ability* to produce non-deterministic
output, and it doesn't matter if your tests do that. The aim of
reproducible builds is to be able to go from the same source code to an
identical binary package. Documents generated by running the tests are
presumably not included in binary packages, so it doesn't matter if they
change.

>  I believe there was some way to modify the hashing introduced when the dos 
> dictionary attacks were an issue. 

The PYTHONHASHSEED environment variable:
https://docs.python.org/3/using/cmdline.html#envvar-PYTHONHASHSEED

If you have non-determinism introduced by Python hashing, setting a
constant value of PYTHONHASHSEED should be an easy way to work around
it.
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Robin Becker

On 18/03/2017 07:20, Nick Coghlan wrote:
...




While the reproducible builds effort started in Debian and is furthest
advanced there, it's not distro specific - interested developers working on
other distros were already looking into it, and the Core Infrastructure
Initiative has backed it as one of their security assurance initiatives.
Software Freedom Conservancy have a decent write-up on the current state of
things after December's Reproducible Builds Summit:
https://sfconservancy.org/blog/2016/dec/26/reproducible-builds-summit-report/
thanks for this; it seems the emphasis is on security. If the intent is that 
reportlab should be able to reliably reproduce the same binary output then I 
think I need to do more than just fix a couple of dates. We use many dictionary 
like objects to produce PDF and I am not sure all are sorted by key during output.


Is there a way to excite dictionary ordering changes? I believe there was some 
way to modify the hashing introduced when the dos dictionary attacks were an 
issue. Would it be sufficient to generate documents with say Python 2.7 and 
check against 3.6?




However, you'll probably want to make yourself a helper function that uses
SOURCE_DATE_EPOCH if defined, and falls back to the current time otherwise.
That way you'll get reproducible behaviour when a build system configures
the setting, while retaining your current behaviour for environments that
don't.


good advice and that's what I am doing.




Cheers,
Nick.

P.S. A question well worth asking for *us* is whether or not setting
SOURCE_DATE_EPOCH appropriately (if it isn't already set in the current
environment) should be part of the build system abstraction PEPs.




--
Robin Becker
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig


Re: [Distutils] reproducible builds

2017-03-20 Thread Robin Becker

On 17/03/2017 17:49, David Wilson wrote:

Hey Robin,


What happens if other distros decide not to use this environment variable?
Do I really want distro specific code in the package?


AFAIK this is seeing a great deal of use outside of Debian and even
Linux, for instance GCC also supports this variable.



In short where does the distro responsibility and package maintainers
boundary need to be?


I guess it mostly comes down to whether you'd like them to carry the
debt of a vendor patch to implement the behaviour for you in a way you
don't like, or you'd prefer to retain full control. :)  So it's more a
preference than a responsibility.


David
.

I think I accept the need to support this variable. Our original use case was 
for testing purposes where we altered dates injected into the produced pdf meta 
data and also in some cases the content.


However, if that is the implied intent of the debian variable then I will also 
need to modify the behaviour of some other tests eg in one case the produced pdf 
output looks like this




The value of i is not larger than 3
The value of i is equal to 3
The value of i is not less than 3
The value of i is 3
The value of i is 2
The value of i is 1
{'doc': , 'currentFrame': 'normal', 'currentPageTemplate': 'First', 
'aW':
439.27559055118104, 'aH': 685.8897637795275, 'aWH': (439.27559055118104,
685.8897637795275), 'i': 0, 'availableWidth': 439.27559055118104, 
'availableHeight':
619.8897637795275}
The current page number is 1


ie we are introspecting internals and injecting that into the document content. 
I imagine I need to clean up the reporting to avoid getting addresses etc etc 
into the documents. Obviously if I have the ability to embed repr(some_object) 
into the document output then it will vary (unless the underlying python is 
reproducible). I'm not sure if debian runs the whole reportlab test suite, but 
it makes sense to get this kind of variablity out.


When we make significant changes to existing behaviours our current workflow 
consists of generating a large number of outputs and then rendering them into 
jpeg pages with ghost script. Differences in the jpegs can be used to spot problems.

--
Robin Becker
___
Distutils-SIG maillist  -  Distutils-SIG@python.org
https://mail.python.org/mailman/listinfo/distutils-sig