Rob, I'm not completely sure whether or not you are talking about the same logging infrastructure that leads to our traffic stats at stats.grok.se [1]. However, having worked with those stats and the raw files provides by Domas [2], I am pretty sure that those squid traffic stats are intended to be a complete traffic sample (or nearly so) and not a 1/1000 sample.
We have done various fractionated samples in the past, but I believe the squid logs used for traffic stats at the present time are not fractionated. If you are talking about a different process of logging not associated with the traffic logs, then I apologize for my confusion. -Robert Rohde [1] http://stats.grok.se/ [2] http://dammit.lt/wikistats/ On Mon, Aug 9, 2010 at 10:16 PM, Rob Lanphier <ro...@wikimedia.org> wrote: > Hi everyone, > > We're in the process of figuring out how we fix some of the issues in > our logging infrastructure. I'm both sending this email out to get > the more knowledgeable folks to chime in about where I've got the > details wrong, and for general comment on how we're doing our logging. > We may need to recruit contract developers to work on this stuff, so > we want to make sure we have clear and accurate information available, > and we need to figure out what exactly we want to direct those people > to do. > > We have a single collection point for all of our logging, which is > actually just a sampling of the overall traffic (designed to be > roughly one out of every 1000 hits). The process is described here: > http://wikitech.wikimedia.org/view/Squid_logging > > My understanding is that this code is also involved somewhere: > http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/ > ...but I'm a little unclear what the relationship between that code > and code in trunk/udplog. > > At any rate, there are a couple of problems with the way that it works: > 1. Once we saturate the NIC on the logging machine, the quality of > our sampling degrades pretty rapidly. We've generally had a problem > with that over the past few months. > 2. We'd like to increase the granularity of logging so that we can do > more sophiticated analysis. For example, if we decide to run a test > banner to a limited audience, we need to make sure we're getting more > complete logs for that audience or else we're not getting enough data > to do any useful analysis. > > If this were your typical commercial operation, the answer would be > "why aren't you just logging into Streambase?" (or some other data > warehousing storage solution). I'm not suggesting that we do that (or > even look at any of the solutions that bill themselves as open source > alternatives), since, while our needs are increasing, we still aren't > planning to be anywhere near as sophisticated as a lot of data > tracking orgs. Still, it's worth asking questions about our existing > setup. Should we be looking optimize our existing single-box setup, > extending our software to have multi-node collection, or looking at a > whole new collection strategy? > > Rob > > _______________________________________________ > Wikitech-l mailing list > Wikitech-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wikitech-l > _______________________________________________ Wikitech-l mailing list Wikitech-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikitech-l