Announcing summershum

2014-02-12 Thread Pierre-Yves Chibon
Good morning everyone,

As you know Ralph and I went to DevConf last week-end, and of course, what
happens when you put two hackers in the same room? Well they go crazy and start
hacking... The result of this is summershum.

The idea originates from a discussion between Mickael Scherrer, Ralph and I on
Friday evening. Could we track all the files in every packages in the
distribution?

Ideally, this would allow us to investigate questions like:
 - How many copies of the GPL license are shipped?
 - How many GPL license still ship the old FSF address?
 - How many copies of jquery or md5.c?
 - How many files changed between two releases?

So Ralph and I wrote summershum, it's a simple database storing for each file in
each package:
 - the packages name
 - the filename
 - the sha1sum of the file
 - the tarball name
 - the md5sum of the tarball
 - a creation date

Next to the database is a fedmsg consumer that for each new upload on the
lookaside cache, download the new tarball, extracts it and fills the database
with the sha1sum of every file found.

There is a RFE opened on the project to store the same information for the
binary/rpm themselves. This would work for each successful build on koji.

The project is currently at: https://github.com/ralphbean/summershum
It comes with a summershum-cli which fills the database using datagrepper to
retrieve the recent uploads to the lookaside cache and load them in the
database.

I think the current state is good enough to start deploying it but we wanted to
announce/discuss about it before taking any further action.


So, what do you think?


Cheers,
Pierre


pgppGZO1otrbw.pgp
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Vít Ondruch
Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a):
> Good morning everyone,
>
> As you know Ralph and I went to DevConf last week-end, and of course, what
> happens when you put two hackers in the same room? Well they go crazy and 
> start
> hacking... The result of this is summershum.
>
> The idea originates from a discussion between Mickael Scherrer, Ralph and I on
> Friday evening. Could we track all the files in every packages in the
> distribution?
>
> Ideally, this would allow us to investigate questions like:
>  - How many copies of the GPL license are shipped?
>  - How many GPL license still ship the old FSF address?
>  - How many copies of jquery or md5.c?
>  - How many files changed between two releases?
>
> So Ralph and I wrote summershum, it's a simple database storing for each file 
> in
> each package:
>  - the packages name
>  - the filename
>  - the sha1sum of the file
>  - the tarball name
>  - the md5sum of the tarball

I don't think we should use md5sum. It is disabled by default in recent
OpenSSL if I am not mistaken.


Vít

>  - a creation date
>
> Next to the database is a fedmsg consumer that for each new upload on the
> lookaside cache, download the new tarball, extracts it and fills the database
> with the sha1sum of every file found.
>
> There is a RFE opened on the project to store the same information for the
> binary/rpm themselves. This would work for each successful build on koji.
>
> The project is currently at: https://github.com/ralphbean/summershum
> It comes with a summershum-cli which fills the database using datagrepper to
> retrieve the recent uploads to the lookaside cache and load them in the
> database.
>
> I think the current state is good enough to start deploying it but we wanted 
> to
> announce/discuss about it before taking any further action.
>
>
> So, what do you think?
>
>
> Cheers,
> Pierre
>
>
> ___
> infrastructure mailing list
> infrastructure@lists.fedoraproject.org
> https://admin.fedoraproject.org/mailman/listinfo/infrastructure

___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Pierre-Yves Chibon
On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote:
>Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a):
>  So Ralph and I wrote summershum, it's a simple database storing for each 
> file in
>  each package:
>   - the packages name
>   - the filename
>   - the sha1sum of the file
>   - the tarball name
>   - the md5sum of the tarball
> 
>I don't think we should use md5sum. It is disabled by default in recent
>OpenSSL if I am not mistaken.

That's what we use in the lookaside cache (the source file in your git), so in
fact I just get this information from the fedmsg, but that's one reason why we
use sha1sum for the files.


Pierre
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Matthew Miller
On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote:
> So Ralph and I wrote summershum, it's a simple database storing for each file 
> in
> each package:
>  - the packages name
>  - the filename
>  - the sha1sum of the file
>  - the tarball name
>  - the md5sum of the tarball
>  - a creation date

Neat!

So, I have one small suggestion and one possibly-too ambitious one. 

The small one is that it might be nice to include the output of `file` on
each file. 

The large one is: while we're going through all of the files, could we index
them for full-text searching? Possibly using https://github.com/Debian/dcs,
or (in my dreams) something that returns the results broken down by current
repo and with options for a summary list by package and file (csv, say).

-- 
Matthew Miller--   Fedora Project--
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

md5 vs sha256 in dist-git sources

2014-02-12 Thread Vít Ondruch
Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a):
> On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote:
>>Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a):
>>  So Ralph and I wrote summershum, it's a simple database storing for each 
>> file in
>>  each package:
>>   - the packages name
>>   - the filename
>>   - the sha1sum of the file
>>   - the tarball name
>>   - the md5sum of the tarball
>>
>>I don't think we should use md5sum. It is disabled by default in recent
>>OpenSSL if I am not mistaken.
> That's what we use in the lookaside cache (the source file in your git)

Interesting, since review guidelines [1] says this:

*MUST*: The sources used to build the package must match the upstream
source, as provided in the spec URL. Reviewers should use sha256sum for
this task as it is used by the |sources| file once imported into git.

But checking some of my packages, you are right that the "sources" file
has md5 has. May be somebody could look into this as well.


Vít



[1] http://fedoraproject.org/wiki/Packaging:ReviewGuidelines

___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Vít Ondruch
Dne 12.2.2014 12:44, Matthew Miller napsal(a):
> On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote:
>> So Ralph and I wrote summershum, it's a simple database storing for each 
>> file in
>> each package:
>>  - the packages name
>>  - the filename
>>  - the sha1sum of the file
>>  - the tarball name
>>  - the md5sum of the tarball
>>  - a creation date
> Neat!
>
> So, I have one small suggestion and one possibly-too ambitious one. 
>
> The small one is that it might be nice to include the output of `file` on
> each file. 

The `file` output would need to be store with `file` version, since its
output is not stable in my experience.

Vít


>
> The large one is: while we're going through all of the files, could we index
> them for full-text searching? Possibly using https://github.com/Debian/dcs,
> or (in my dreams) something that returns the results broken down by current
> repo and with options for a summary list by package and file (csv, say).
>

___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Pierre-Yves Chibon
On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote:
> Dne 12.2.2014 12:44, Matthew Miller napsal(a):
> > On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote:
> >> So Ralph and I wrote summershum, it's a simple database storing for each 
> >> file in
> >> each package:
> >>  - the packages name
> >>  - the filename
> >>  - the sha1sum of the file
> >>  - the tarball name
> >>  - the md5sum of the tarball
> >>  - a creation date
> > Neat!
> >
> > So, I have one small suggestion and one possibly-too ambitious one. 
> >
> > The small one is that it might be nice to include the output of `file` on
> > each file. 

Technically doable but I'm curious, what use-case do you have in mind?

Plus I've ran into:
$ file /usr/share/doc/python-magic-5.04/example.py -b
ASCII Java program text

Seems legit, right? :)

> The `file` output would need to be store with `file` version, since its
> output is not stable in my experience.

That's not a problem, we stored the tarball name which does contain the version


Pierre
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Matthew Miller
On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote:
> > The small one is that it might be nice to include the output of `file` on
> > each file. 
> The `file` output would need to be store with `file` version, since its
> output is not stable in my experience.


Yeah, that's kind of unfortunate. The man page says:

 The type printed will usually contain one of the words _text_ (the
 file contains only printing characters and a few common control
 characters and is probably safe to read on an ASCII terminal),
 _executable_ (the file contains the result of compiling a program
 in a form understandable to some UNIX kernel or another), or
 _data_ meaning anything else (data is usually “binary” or
 non-printable). Exceptions are well-known file formats (core
 files, tar archives) that are known to contain binary data. When
 modifying magic files or the program itself, make sure to preserve
 these keywords. Users depend on knowing that all the readable
 files in a directory have the word “text” printed. Don't do as
 Berkeley did and change “shell commands text” to “shell script”.

I'm not sure how stable the "well-known file formats" list is.

Maybe we could pick to standardize on whatever version is in RHEL7 and keep
that until RHEL7 EOL? That should give us a nice long timeframe.
Alternately, we could accept that the strings might change slightly and not
worry too much. We might need to special-case some things, too -- by
default, it prints the target of symbolic links, for example.


-- 
Matthew Miller--   Fedora Project--
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Matthew Miller
On Wed, Feb 12, 2014 at 02:31:28PM +0100, Pierre-Yves Chibon wrote:
> > > The small one is that it might be nice to include the output of `file` on
> > > each file. 
> Technically doable but I'm curious, what use-case do you have in mind?

Looking for binaries, blobs, and archives that have crept in to the source.

Also, pretty charts. :)

> Plus I've ran into:
> $ file /usr/share/doc/python-magic-5.04/example.py -b
> ASCII Java program text
> Seems legit, right? :)

Comes out "Python script, ASCII text executable" on Rawhide



-- 
Matthew Miller--   Fedora Project--
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Matthew Miller
On Wed, Feb 12, 2014 at 08:43:18AM -0500, Matthew Miller wrote:
> > Technically doable but I'm curious, what use-case do you have in mind?
> Looking for binaries, blobs, and archives that have crept in to the source.
> Also, pretty charts. :)

Also, if combined with the full-text search, one could say "search in C and
C++ source files" when looking for a known vulnerability.


-- 
Matthew Miller--   Fedora Project--
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Pierre-Yves Chibon
On Wed, Feb 12, 2014 at 08:43:18AM -0500, Matthew Miller wrote:
> On Wed, Feb 12, 2014 at 02:31:28PM +0100, Pierre-Yves Chibon wrote:
> > > > The small one is that it might be nice to include the output of `file` 
> > > > on
> > > > each file. 
> > Technically doable but I'm curious, what use-case do you have in mind?
> 
> Looking for binaries, blobs, and archives that have crept in to the source.

Cool, good idea :)

> Also, pretty charts. :)

Dang, you found my one weakness! Arg!!

> > Plus I've ran into:
> > $ file /usr/share/doc/python-magic-5.04/example.py -b
> > ASCII Java program text
> > Seems legit, right? :)
> 
> Comes out "Python script, ASCII text executable" on Rawhide

Indeed, that was on an EL6 machine, I was checking if that library was available
on EL6 (OS of our infra).

Pierre
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: md5 vs sha256 in dist-git sources

2014-02-12 Thread Mathieu Bridon
On Wed, 2014-02-12 at 13:44 +0100, Vít Ondruch wrote:
> Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a):
> 
> > On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote:
> > >Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a):
> > >  So Ralph and I wrote summershum, it's a simple database storing for each 
> > > file in
> > >  each package:
> > >   - the packages name
> > >   - the filename
> > >   - the sha1sum of the file
> > >   - the tarball name
> > >   - the md5sum of the tarball
> > > 
> > >I don't think we should use md5sum. It is disabled by default in recent
> > >OpenSSL if I am not mistaken.
> > That's what we use in the lookaside cache (the source file in your git)
> 
> Interesting, since review guidelines [1] says this:
> 
> MUST: The sources used to build the package must match the upstream
> source, as provided in the spec URL. Reviewers should use sha256sum
> for this task as it is used by the sources file once imported into
> git.
> 
> But checking some of my packages, you are right that the "sources"
> file has md5 has. May be somebody could look into this as well.


Afaik, the hashing mechanism to use is defined in the fedpkg
configuration file:

https://git.fedorahosted.org/cgit/fedpkg.git/tree/src/fedpkg.conf

So theoretically, you could change it locally, and the sources you
upload would then have their sha256sum in the `sources` file.

But then, people who would download them with `fedpkg sources` (that
includes Koji builders) would receive error messages that the checksum
does not match.

So we would probably need to add a fallback mechanism in pyrpkg, so that
if sha256 verification fails, then it would try md5.


-- 
Mathieu

___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: md5 vs sha256 in dist-git sources

2014-02-12 Thread Vít Ondruch
Dne 12.2.2014 15:29, Mathieu Bridon napsal(a):
> On Wed, 2014-02-12 at 13:44 +0100, Vít Ondruch wrote:
>> Dne 12.2.2014 12:15, Pierre-Yves Chibon napsal(a):
>>
>>> On Wed, Feb 12, 2014 at 11:58:15AM +0100, Vít Ondruch wrote:
Dne 12.2.2014 09:46, Pierre-Yves Chibon napsal(a):
  So Ralph and I wrote summershum, it's a simple database storing for each 
 file in
  each package:
   - the packages name
   - the filename
   - the sha1sum of the file
   - the tarball name
   - the md5sum of the tarball

I don't think we should use md5sum. It is disabled by default in recent
OpenSSL if I am not mistaken.
>>> That's what we use in the lookaside cache (the source file in your git)
>> Interesting, since review guidelines [1] says this:
>>
>> MUST: The sources used to build the package must match the upstream
>> source, as provided in the spec URL. Reviewers should use sha256sum
>> for this task as it is used by the sources file once imported into
>> git.
>>
>> But checking some of my packages, you are right that the "sources"
>> file has md5 has. May be somebody could look into this as well.
>
> Afaik, the hashing mechanism to use is defined in the fedpkg
> configuration file:
>
> https://git.fedorahosted.org/cgit/fedpkg.git/tree/src/fedpkg.conf
>
> So theoretically, you could change it locally, and the sources you
> upload would then have their sha256sum in the `sources` file.
>
> But then, people who would download them with `fedpkg sources` (that
> includes Koji builders) would receive error messages that the checksum
> does not match.
>
> So we would probably need to add a fallback mechanism in pyrpkg, so that
> if sha256 verification fails, then it would try md5.
>
>

Looks to be sub-optimal so to say :)


Vít
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Vít Ondruch
Dne 12.2.2014 14:31, Pierre-Yves Chibon napsal(a):
> On Wed, Feb 12, 2014 at 01:47:08PM +0100, Vít Ondruch wrote:
>> Dne 12.2.2014 12:44, Matthew Miller napsal(a):
>>> On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote:
 So Ralph and I wrote summershum, it's a simple database storing for each 
 file in
 each package:
  - the packages name
  - the filename
  - the sha1sum of the file
  - the tarball name
  - the md5sum of the tarball
  - a creation date
>>> Neat!
>>>
>>> So, I have one small suggestion and one possibly-too ambitious one. 
>>>
>>> The small one is that it might be nice to include the output of `file` on
>>> each file. 
> Technically doable but I'm curious, what use-case do you have in mind?
>
> Plus I've ran into:
> $ file /usr/share/doc/python-magic-5.04/example.py -b
> ASCII Java program text
>
> Seems legit, right? :)
>
>> The `file` output would need to be store with `file` version, since its
>> output is not stable in my experience.
> That's not a problem, we stored the tarball name which does contain the 
> version
>

I was referring to the file utility version. I know that file utility
provides different output for html file, sometimes they are clasified as
html, other time as xhtml or  xml. For Ruby files, it returns something
like "ruby file" or "ruby module" but the heuristic is unreliable.

See Matt's email, who parsed my objection correctly ;)



Vít
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Jeff Sheltren
On Wed, Feb 12, 2014 at 12:46 AM, Pierre-Yves Chibon wrote:
>
>
> The idea originates from a discussion between Mickael Scherrer, Ralph and
> I on
> Friday evening. Could we track all the files in every packages in the
> distribution?
>
> Ideally, this would allow us to investigate questions like:
>  - How many copies of the GPL license are shipped?
>  - How many GPL license still ship the old FSF address?
>  - How many copies of jquery or md5.c?
>  - How many files changed between two releases?


Cool idea, and sounds a lot like what FOSSology could do for you already (
http://www.fossology.org/projects/fossology).  Have you checked that out?

-Jeff
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: md5 vs sha256 in dist-git sources

2014-02-12 Thread Kevin Fenzi
There's a releng ticket (formerly infra) about moving away from md5 for
this. 

https://fedorahosted.org/rel-eng/ticket/5846

I'm sure any offers of help would be welcome. :) 

kevin


signature.asc
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: md5 vs sha256 in dist-git sources

2014-02-12 Thread Mathieu Bridon
On Wed, 2014-02-12 at 08:22 -0700, Kevin Fenzi wrote:
> There's a releng ticket (formerly infra) about moving away from md5 for
> this. 
> 
> https://fedorahosted.org/rel-eng/ticket/5846
> 
> I'm sure any offers of help would be welcome. :) 

Done.


-- 
Mathieu

___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Re: Announcing summershum

2014-02-12 Thread Till Maas
Hi,

this sounds like a good idea.

On Wed, Feb 12, 2014 at 09:46:27AM +0100, Pierre-Yves Chibon wrote:

> So Ralph and I wrote summershum, it's a simple database storing for each file 
> in
> each package:

>  - the sha1sum of the file

>  - the md5sum of the tarball

It might be helpful to store multiple hashsums per item, e.g. md5, sha1,
sha-256 and sha-512 allowing the data to be easily cross-referenced with
other data sources that might not use the same hash algorithm as
summershum.

Regards
Till
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure

Plan for tomorrow's Fedora Infrastructure meeting (2014-02-13)

2014-02-12 Thread Kevin Fenzi
The infrastructure team will be having it's weekly meeting tomorrow, 
2014-02-13 at 19:00 UTC in #fedora-meeting on the freenode network.

Suggested topics:

#topic New folks introductions and Apprentice tasks.

If any new folks want to give a quick one line bio or any apprentices
would like to ask general questions, they can do so in this part of the
meeting. Don't be shy!

#topic Applications status / discussion

Check in on status of our applications: pkgdb, fas, bodhi, koji,
community, voting, tagger, packager, dpsearch, etc. 
If there's new releases, bugs we need to work around or things to note. 

#topic Sysadmin status / discussion

Here we talk about sysadmin related happenings from the previous week,
or things that are upcoming. 

#topic Upcoming Tasks/Items 

https://apps.fedoraproject.org/calendar/list/infrastructure/

#topic Open Floor

Submit your agenda items, as tickets in the trac instance and send a 
note replying to this thread.

More info here:

https://fedoraproject.org/wiki/Infrastructure/Meetings#Meetings

Thanks

kevin


signature.asc
Description: PGP signature
___
infrastructure mailing list
infrastructure@lists.fedoraproject.org
https://admin.fedoraproject.org/mailman/listinfo/infrastructure