Announcing summershum
Good morning everyone, As you know Ralph and I went to DevConf last week-end, and of course, what happens when you put two hackers in the same room? Well they go crazy and start hacking... The result of this is summershum. The idea originates from a discussion between Mickael Scherrer, Ralph and I on Friday evening. Could we track all the files in every packages in the distribution? Ideally, this would allow us to investigate questions like: - How many copies of the GPL license are shipped? - How many GPL license still ship the old FSF address? - How many copies of jquery or md5.c? - How many files changed between two releases? So Ralph and I wrote summershum, it's a simple database storing for each file in each package: - the packages name - the filename - the sha1sum of the file - the tarball name - the md5sum of the tarball - a creation date Next to the database is a fedmsg consumer that for each new upload on the lookaside cache, download the new tarball, extracts it and fills the database with the sha1sum of every file found. There is a RFE opened on the project to store the same information for the binary/rpm themselves. This would work for each successful build on koji. The project is currently at: https://github.com/ralphbean/summershum It comes with a summershum-cli which fills the database using datagrepper to retrieve the recent uploads to the lookaside cache and load them in the database. I think the current state is good enough to start deploying it but we wanted to announce/discuss about it before taking any further action. So, what do you think? Cheers, Pierre pgppGZO1otrbw.pgp Description: PGP signature ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a): > Good morning everyone, > > As you know Ralph and I went to DevConf last week-end, and of course, what > happens when you put two hackers in the same room? Well they go crazy and > start > hacking... The result of this is summershum. > > The idea originates from a discussion between Mickael Scherrer, Ralph and I on > Friday evening. Could we track all the files in every packages in the > distribution? > > Ideally, this would allow us to investigate questions like: > - How many copies of the GPL license are shipped? > - How many GPL license still ship the old FSF address? > - How many copies of jquery or md5.c? > - How many files changed between two releases? > > So Ralph and I wrote summershum, it's a simple database storing for each file > in > each package: > - the packages name > - the filename > - the sha1sum of the file > - the tarball name > - the md5sum of the tarball I don't think we should use md5sum. It is disabled by default in recent OpenSSL if I am not mistaken. Vít > - a creation date > > Next to the database is a fedmsg consumer that for each new upload on the > lookaside cache, download the new tarball, extracts it and fills the database > with the sha1sum of every file found. > > There is a RFE opened on the project to store the same information for the > binary/rpm themselves. This would work for each successful build on koji. > > The project is currently at: https://github.com/ralphbean/summershum > It comes with a summershum-cli which fills the database using datagrepper to > retrieve the recent uploads to the lookaside cache and load them in the > database. > > I think the current state is good enough to start deploying it but we wanted > to > announce/discuss about it before taking any further action. > > > So, what do you think? > > > Cheers, > Pierre > > > ___ > infrastructure mailing list > infrastructure@lists.fedoraproject.org > https://admin.fedoraproject.org/mailman/listinfo/infrastructure ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote: >Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a): > So Ralph and I wrote summershum, it's a simple database storing for each > file in > each package: > - the packages name > - the filename > - the sha1sum of the file > - the tarball name > - the md5sum of the tarball > >I don't think we should use md5sum. It is disabled by default in recent >OpenSSL if I am not mistaken. That's what we use in the lookaside cache (the source file in your git), so in fact I just get this information from the fedmsg, but that's one reason why we use sha1sum for the files. Pierre ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote: > So Ralph and I wrote summershum, it's a simple database storing for each file > in > each package: > - the packages name > - the filename > - the sha1sum of the file > - the tarball name > - the md5sum of the tarball > - a creation date Neat! So, I have one small suggestion and one possibly-too ambitious one. The small one is that it might be nice to include the output of `file` on each file. The large one is: while we're going through all of the files, could we index them for full-text searching? Possibly using https://github.com/Debian/dcs, or (in my dreams) something that returns the results broken down by current repo and with options for a summary list by package and file (csv, say). -- Matthew Miller-- Fedora Project-- ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
md5 vs sha256 in dist-git sources
Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a): > On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote: >>Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a): >> So Ralph and I wrote summershum, it's a simple database storing for each >> file in >> each package: >> - the packages name >> - the filename >> - the sha1sum of the file >> - the tarball name >> - the md5sum of the tarball >> >>I don't think we should use md5sum. It is disabled by default in recent >>OpenSSL if I am not mistaken. > That's what we use in the lookaside cache (the source file in your git) Interesting, since review guidelines [1] says this: *MUST*: The sources used to build the package must match the upstream source, as provided in the spec URL. Reviewers should use sha256sum for this task as it is used by the |sources| file once imported into git. But checking some of my packages, you are right that the "sources" file has md5 has. May be somebody could look into this as well. Vít [1] http://fedoraproject.org/wiki/Packaging:ReviewGuidelines ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
Dne 12.2.2014 12:44, Matthew Miller napsal(a): > On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote: >> So Ralph and I wrote summershum, it's a simple database storing for each >> file in >> each package: >> - the packages name >> - the filename >> - the sha1sum of the file >> - the tarball name >> - the md5sum of the tarball >> - a creation date > Neat! > > So, I have one small suggestion and one possibly-too ambitious one. > > The small one is that it might be nice to include the output of `file` on > each file. The `file` output would need to be store with `file` version, since its output is not stable in my experience. Vít > > The large one is: while we're going through all of the files, could we index > them for full-text searching? Possibly using https://github.com/Debian/dcs, > or (in my dreams) something that returns the results broken down by current > repo and with options for a summary list by package and file (csv, say). > ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote: > Dne 12.2.2014 12:44, Matthew Miller napsal(a): > > On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote: > >> So Ralph and I wrote summershum, it's a simple database storing for each > >> file in > >> each package: > >> - the packages name > >> - the filename > >> - the sha1sum of the file > >> - the tarball name > >> - the md5sum of the tarball > >> - a creation date > > Neat! > > > > So, I have one small suggestion and one possibly-too ambitious one. > > > > The small one is that it might be nice to include the output of `file` on > > each file. Technically doable but I'm curious, what use-case do you have in mind? Plus I've ran into: $ file /usr/share/doc/python-magic-5.04/example.py -b ASCII Java program text Seems legit, right? :) > The `file` output would need to be store with `file` version, since its > output is not stable in my experience. That's not a problem, we stored the tarball name which does contain the version Pierre ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote: > > The small one is that it might be nice to include the output of `file` on > > each file. > The `file` output would need to be store with `file` version, since its > output is not stable in my experience. Yeah, that's kind of unfortunate. The man page says: The type printed will usually contain one of the words _text_ (the file contains only printing characters and a few common control characters and is probably safe to read on an ASCII terminal), _executable_ (the file contains the result of compiling a program in a form understandable to some UNIX kernel or another), or _data_ meaning anything else (data is usually “binary” or non-printable). Exceptions are well-known file formats (core files, tar archives) that are known to contain binary data. When modifying magic files or the program itself, make sure to preserve these keywords. Users depend on knowing that all the readable files in a directory have the word “text” printed. Don't do as Berkeley did and change “shell commands text” to “shell script”. I'm not sure how stable the "well-known file formats" list is. Maybe we could pick to standardize on whatever version is in RHEL7 and keep that until RHEL7 EOL? That should give us a nice long timeframe. Alternately, we could accept that the strings might change slightly and not worry too much. We might need to special-case some things, too -- by default, it prints the target of symbolic links, for example. -- Matthew Miller-- Fedora Project-- ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 02:31:28PM +0100, Pierre-Yves Chibon wrote: > > > The small one is that it might be nice to include the output of `file` on > > > each file. > Technically doable but I'm curious, what use-case do you have in mind? Looking for binaries, blobs, and archives that have crept in to the source. Also, pretty charts. :) > Plus I've ran into: > $ file /usr/share/doc/python-magic-5.04/example.py -b > ASCII Java program text > Seems legit, right? :) Comes out "Python script, ASCII text executable" on Rawhide -- Matthew Miller-- Fedora Project-- ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 08:43:18AM -0500, Matthew Miller wrote: > > Technically doable but I'm curious, what use-case do you have in mind? > Looking for binaries, blobs, and archives that have crept in to the source. > Also, pretty charts. :) Also, if combined with the full-text search, one could say "search in C and C++ source files" when looking for a known vulnerability. -- Matthew Miller-- Fedora Project-- ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 08:43:18AM -0500, Matthew Miller wrote: > On Wed, Feb 12, 2014 at 02:31:28PM +0100, Pierre-Yves Chibon wrote: > > > > The small one is that it might be nice to include the output of `file` > > > > on > > > > each file. > > Technically doable but I'm curious, what use-case do you have in mind? > > Looking for binaries, blobs, and archives that have crept in to the source. Cool, good idea :) > Also, pretty charts. :) Dang, you found my one weakness! Arg!! > > Plus I've ran into: > > $ file /usr/share/doc/python-magic-5.04/example.py -b > > ASCII Java program text > > Seems legit, right? :) > > Comes out "Python script, ASCII text executable" on Rawhide Indeed, that was on an EL6 machine, I was checking if that library was available on EL6 (OS of our infra). Pierre ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: md5 vs sha256 in dist-git sources
On Wed, 2014-02-12 at 13:44 +0100, Vít Ondruch wrote: > Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a): > > > On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote: > > >Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a): > > > So Ralph and I wrote summershum, it's a simple database storing for each > > > file in > > > each package: > > > - the packages name > > > - the filename > > > - the sha1sum of the file > > > - the tarball name > > > - the md5sum of the tarball > > > > > >I don't think we should use md5sum. It is disabled by default in recent > > >OpenSSL if I am not mistaken. > > That's what we use in the lookaside cache (the source file in your git) > > Interesting, since review guidelines [1] says this: > > MUST: The sources used to build the package must match the upstream > source, as provided in the spec URL. Reviewers should use sha256sum > for this task as it is used by the sources file once imported into > git. > > But checking some of my packages, you are right that the "sources" > file has md5 has. May be somebody could look into this as well. Afaik, the hashing mechanism to use is defined in the fedpkg configuration file: https://git.fedorahosted.org/cgit/fedpkg.git/tree/src/fedpkg.conf So theoretically, you could change it locally, and the sources you upload would then have their sha256sum in the `sources` file. But then, people who would download them with `fedpkg sources` (that includes Koji builders) would receive error messages that the checksum does not match. So we would probably need to add a fallback mechanism in pyrpkg, so that if sha256 verification fails, then it would try md5. -- Mathieu ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: md5 vs sha256 in dist-git sources
Dne 12.2.2014 15:29, Mathieu Bridon napsal(a): > On Wed, 2014-02-12 at 13:44 +0100, Vít Ondruch wrote: >> Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a): >> >>> On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote: Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a): So Ralph and I wrote summershum, it's a simple database storing for each file in each package: - the packages name - the filename - the sha1sum of the file - the tarball name - the md5sum of the tarball I don't think we should use md5sum. It is disabled by default in recent OpenSSL if I am not mistaken. >>> That's what we use in the lookaside cache (the source file in your git) >> Interesting, since review guidelines [1] says this: >> >> MUST: The sources used to build the package must match the upstream >> source, as provided in the spec URL. Reviewers should use sha256sum >> for this task as it is used by the sources file once imported into >> git. >> >> But checking some of my packages, you are right that the "sources" >> file has md5 has. May be somebody could look into this as well. > > Afaik, the hashing mechanism to use is defined in the fedpkg > configuration file: > > https://git.fedorahosted.org/cgit/fedpkg.git/tree/src/fedpkg.conf > > So theoretically, you could change it locally, and the sources you > upload would then have their sha256sum in the `sources` file. > > But then, people who would download them with `fedpkg sources` (that > includes Koji builders) would receive error messages that the checksum > does not match. > > So we would probably need to add a fallback mechanism in pyrpkg, so that > if sha256 verification fails, then it would try md5. > > Looks to be sub-optimal so to say :) Vít ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
Dne 12.2.2014 14:31, Pierre-Yves Chibon napsal(a): > On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote: >> Dne 12.2.2014 12:44, Matthew Miller napsal(a): >>> On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote: So Ralph and I wrote summershum, it's a simple database storing for each file in each package: - the packages name - the filename - the sha1sum of the file - the tarball name - the md5sum of the tarball - a creation date >>> Neat! >>> >>> So, I have one small suggestion and one possibly-too ambitious one. >>> >>> The small one is that it might be nice to include the output of `file` on >>> each file. > Technically doable but I'm curious, what use-case do you have in mind? > > Plus I've ran into: > $ file /usr/share/doc/python-magic-5.04/example.py -b > ASCII Java program text > > Seems legit, right? :) > >> The `file` output would need to be store with `file` version, since its >> output is not stable in my experience. > That's not a problem, we stored the tarball name which does contain the > version > I was referring to the file utility version. I know that file utility provides different output for html file, sometimes they are clasified as html, other time as xhtml or xml. For Ruby files, it returns something like "ruby file" or "ruby module" but the heuristic is unreliable. See Matt's email, who parsed my objection correctly ;) Vít ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
On Wed, Feb 12, 2014 at 12:46 AM, Pierre-Yves Chibon wrote: > > > The idea originates from a discussion between Mickael Scherrer, Ralph and > I on > Friday evening. Could we track all the files in every packages in the > distribution? > > Ideally, this would allow us to investigate questions like: > - How many copies of the GPL license are shipped? > - How many GPL license still ship the old FSF address? > - How many copies of jquery or md5.c? > - How many files changed between two releases? Cool idea, and sounds a lot like what FOSSology could do for you already ( http://www.fossology.org/projects/fossology). Have you checked that out? -Jeff ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: md5 vs sha256 in dist-git sources
There's a releng ticket (formerly infra) about moving away from md5 for this. https://fedorahosted.org/rel-eng/ticket/5846 I'm sure any offers of help would be welcome. :) kevin signature.asc Description: PGP signature ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: md5 vs sha256 in dist-git sources
On Wed, 2014-02-12 at 08:22 -0700, Kevin Fenzi wrote: > There's a releng ticket (formerly infra) about moving away from md5 for > this. > > https://fedorahosted.org/rel-eng/ticket/5846 > > I'm sure any offers of help would be welcome. :) Done. -- Mathieu ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Re: Announcing summershum
Hi, this sounds like a good idea. On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote: > So Ralph and I wrote summershum, it's a simple database storing for each file > in > each package: > - the sha1sum of the file > - the md5sum of the tarball It might be helpful to store multiple hashsums per item, e.g. md5, sha1, sha-256 and sha-512 allowing the data to be easily cross-referenced with other data sources that might not use the same hash algorithm as summershum. Regards Till ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure
Plan for tomorrow's Fedora Infrastructure meeting (2014-02-13)
The infrastructure team will be having it's weekly meeting tomorrow, 2014-02-13 at 19:00 UTC in #fedora-meeting on the freenode network. Suggested topics: #topic New folks introductions and Apprentice tasks. If any new folks want to give a quick one line bio or any apprentices would like to ask general questions, they can do so in this part of the meeting. Don't be shy! #topic Applications status / discussion Check in on status of our applications: pkgdb, fas, bodhi, koji, community, voting, tagger, packager, dpsearch, etc. If there's new releases, bugs we need to work around or things to note. #topic Sysadmin status / discussion Here we talk about sysadmin related happenings from the previous week, or things that are upcoming. #topic Upcoming Tasks/Items https://apps.fedoraproject.org/calendar/list/infrastructure/ #topic Open Floor Submit your agenda items, as tickets in the trac instance and send a note replying to this thread. More info here: https://fedoraproject.org/wiki/Infrastructure/Meetings#Meetings Thanks kevin signature.asc Description: PGP signature ___ infrastructure mailing list infrastructure@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/infrastructure