Re: [Wikitech-l] Wikimedia logging infrastructure

Robert Rohde Tue, 10 Aug 2010 00:36:47 -0700

Rob,

I'm not completely sure whether or not you are talking about the same
logging infrastructure that leads to our traffic stats at
stats.grok.se [1].  However, having worked with those stats and the
raw files provides by Domas [2], I am pretty sure that those squid
traffic stats are intended to be a complete traffic sample (or nearly
so) and not a 1/1000 sample.


We have done various fractionated samples in the past, but I believe
the squid logs used for traffic stats at the present time are not
fractionated.

If you are talking about a different process of logging not associated
with the traffic logs, then I apologize for my confusion.

-Robert Rohde

[1] http://stats.grok.se/
[2] http://dammit.lt/wikistats/



On Mon, Aug 9, 2010 at 10:16 PM, Rob Lanphier <ro...@wikimedia.org> wrote:
> Hi everyone,
>
> We're in the process of figuring out how we fix some of the issues in
> our logging infrastructure.  I'm both sending this email out to get
> the more knowledgeable folks to chime in about where I've got the
> details wrong, and for general comment on how we're doing our logging.
>  We may need to recruit contract developers to work on this stuff, so
> we want to make sure we have clear and accurate information available,
> and we need to figure out what exactly we want to direct those people
> to do.
>
> We have a single collection point for all of our logging, which is
> actually just a sampling of the overall traffic (designed to be
> roughly one out of every 1000 hits).  The process is described here:
> http://wikitech.wikimedia.org/view/Squid_logging
>
> My understanding is that this code is also involved somewhere:
> http://svn.wikimedia.org/viewvc/mediawiki/trunk/webstatscollector/
> ...but I'm a little unclear what the relationship between that code
> and code in trunk/udplog.
>
> At any rate, there are a couple of problems with the way that it works:
> 1.  Once we saturate the NIC on the logging machine, the quality of
> our sampling degrades pretty rapidly.  We've generally had a problem
> with that over the past few months.
> 2.  We'd like to increase the granularity of logging so that we can do
> more sophiticated analysis.  For example, if we decide to run a test
> banner to a limited audience, we need to make sure we're getting more
> complete logs for that audience or else we're not getting enough data
> to do any useful analysis.
>
> If this were your typical commercial operation, the answer would be
> "why aren't you just logging into Streambase?" (or some other data
> warehousing storage solution).  I'm not suggesting that we do that (or
> even look at any of the solutions that bill themselves as open source
> alternatives), since, while our needs are increasing, we still aren't
> planning to be anywhere near as sophisticated as a lot of data
> tracking orgs.  Still, it's worth asking questions about our existing
> setup.  Should we be looking optimize our existing single-box setup,
> extending our software to have multi-node collection, or looking at a
> whole new collection strategy?
>
> Rob
>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l
>

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Re: [Wikitech-l] Wikimedia logging infrastructure

Reply via email to