Hi folks, This message isn't _directly_ related to reproducible builds, but it does relate to unexpected differences in text (including, potentially, source code) checked out from git repositories, and I think that that could be relevant to the audience here.
Some of the code within the Sphinx documentation generator removes carriage-return ('\r' in Python string literal escape code notation) characters from input documents before checksumming them, and that part of the code puzzled me - generally any kind of content modification before checksumming seems like a code smell to me. The relevant code removes those carriage-returns so that the checksums produced are in a sense cross-platform compatible; that is, the 'same content' produces the same checksum whether the platform uses CRLF or LF line-endings. Now, Python itself does include some functionality[1] to handle what it refers to as 'universal newlines'; newlines in strings are generally represented using a single '\n' character, that is serialized and deserialized to CRLF or LF as platform-appropriate. This is stable, mature and well-established behaviour at this point. That universal newline handling may cause problems in some cases if not handled carefully, but surprisingly -- at least to me -- 'git' itself also automatically converts the line-endings of files to the local platform's standard. I suppose this makes sense so that developer tooling designed for each platform works as-expected with text stored in git repositories (which, internally, store the newlines using LF). However it does mean that the checksums of files checked out from the same origin git repository can differ on different OS platforms. Overriding this behaviour on a per-file basis is possible using .gitattributes config[2] file(s) within the repository, or alternatively a git client system system can use the 'core.autocrlf' configuration setting[3] to specify the desired line-ending-conversion method. Again: this is probably slightly off-topic and perhaps not of direct relevance to anyone on the list today. However, it seems like the kind of issue that is useful to be aware of if-and-when puzzling over unexpected git content / checksum issues (situations that I _do_ expect people on this list encounter from time-to-time). Regards, James [1] - https://docs.python.org/3.12/glossary.html#term-universal-newlines [2] - https://git-scm.com/docs/gitattributes [3] - https://git-scm.com/docs/git-config#Documentation/git-config.txt-coresafecrlf PS: For anyone concerned that this might inadvertently expose some kind of checksumming vulnerability; I briefly worried about that after determining the line-ending behaviour to be the cause. Padding of source files with carriage-returns could be a way for bad actors to attempt to find checksum collisions, yes; but equally, newlines -- or spaces -- are available to achieve the same. Are there any languages that attempt to prevent arbitrary source code padding so that checksum-space-exploration from a known code plaintext is constrained? Golang and other languages that require or support autoformatting may be the safest bets.