On 06/07/2013 05:08 PM, Stephen John Smoogen wrote: > On 7 June 2013 13:48, Matthew Miller <mat...@fedoraproject.org> wrote: > >> On Fri, Jun 07, 2013 at 01:31:36PM -0600, Stephen John Smoogen wrote: >>> The easiest way I could see is just get a better sampling method which >>> would be to have funding for a mirror which we then put into >> mirror-manager >>> and we know that this is a sampling versus a request info. (basically we >>> would see what packages are downloaded directly and then extend that >> sample >>> from the amount of downloads to the 500,000 systems that check in via >>> mirrormanager). The problems involved are paying for systems, storage, >> and >>> bandwidth for such items. >> >> Maybe one of the mirrors would be able to provide logs? >> >> > Possibly. In the past mirror admins have not wanted to do so for many > reasons (can't keep logs longer than 24 hours for policy reasons, can't > give over logs without a formal agreement and then with as much redacted as > possible, if we do it for X then we have to do it for everyone so no > thankyou.) When I was at my university gig, it had to go up 4 levels of > management before I gave up at the sub-CIO level.) > > I have tried looking at the top level mirrors but most of the data is > swamped out by other sites mirroring and lots of people doing development > work and pointing to repos directly. This led to some strange statistics > where trying to pull out even most of the noise made for various packages > to "stand out" until I realized they were pulled in for cross-compiles and > such (or the site that likes to do partial mirrors every couple of hours > but always pulls in the same 4 packages each time even when it pulls in > others.) I am expecting that other mirrors are going to run into that which > means that stuff that a lot of sites could give out (just the urls per day) > versus the IP address, URL would mean that the data would have a lot of > weird noise that makes say zvbi show up high because it is both getting > mirrored as the last package on the server and also because 8 packages use > it as depends (not true but I can't remember the package that showed up a > ton.) > > In either case, it is what got me to realize that a mirror is needed to > allow for better statistics of this sort because the data can be cleaned as > needed versus pre-cleaned and reanimated.
Compelling information, thanks. I might still want to pursue improving the data collection across an existing mirror network, but for now I like your idea of inserting a tracking-mirror in to the system. I've been doing a lot of thinking lately about mirroring, logs, and anonymity. This is because I think we want to get more data about EPEL usage without raising privacy or other legal concerns. My impetus is simple, EPEL is an enormously important and popular part of the Fedora Project to all of us, and my job is helping make such projects wildly successful. :) To figure out what wild success means and track our progress, we need a better handle on usage. A tracking-mirror could go something like this: * Logs are rotated out to the trash regularly, e.g. 24 hours.[1] * Data is gathered from logs in real time in an anonymous fashion, so nothing non-anonymous is inserted in to the database. No connection is retained between the data in the database and the logs not yet thrown away. * The log data gathering process attempts to cleanse in real time before writing to the database. (This aligns with your idea, yes?) * Work closely with the cleansing tools for a period of time to get a handle on the sort of confusions you've experienced; see if programmatic predictions can help keep watch in the future (e.g. alert on unusual spikes in traffic to a small package set with certain patterns such as near-each-other-alphabetically or used-together-often-as-dependencies.) * We use statistical analysis to extrapolate wider conclusions. * We make it possible to grow this tracking-mirror network within the existing mirror network to improve the dataset. * Throughout, code and configurations are dealt with transparently so it is clear to community members not only that a better quality of tracking is happening, but what the results of that tracking are (the analysis itself, actions taken from the analysis that benefits users), and that all details are there showing the protection of privacy. I'm interested in championing this idea to get the resources (server, bandwidth, peoples, code, etc.) to make at least the initial mirror happen. With the right plan, I could see getting things in place pretty quickly e.g. September. - Karsten [1] We could consider sending logs directly to /dev/null after data collection if we felt data collection was sufficient. The main risk there is in reducing the ability to troubleshoot. It's an interesting thought exercise at least to find a way toward dropping non-anonymous information without even a millisecond of retention. Such as, pulling anonymous data to the dataset, then cleansing the stream toward privacy before writing it to the log. For example, it might be sufficient for troubleshooting to know a class C IP block but drop the specific IP address. -- Karsten 'quaid' Wade http://TheOpenSourceWay.org .^\ http://community.redhat.com @quaid (identi.ca/twitter/IRC) \v' gpg: AD0E0C41
signature.asc
Description: OpenPGP digital signature
_______________________________________________ epel-devel mailing list epel-devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/epel-devel