[Analytics] Metrics about the external use of the Wikimedia APIs

2015-06-10 Thread Quim Gil
I have been asking this question informally for too long, so here goes the formal request: Metrics about the external use of the Wikimedia APIs https://phabricator.wikimedia.org/T102079 We need them and, in fact, an outsider would be very surprised by the fact that we don't have them today and we

[Analytics] Fwd: Some data on apps and web

2015-06-10 Thread Adam Baso
Cross-posting to analytics. Props to Vibha for asking for the data. -- Forwarded message -- From: *Adam Baso* Date: Wednesday, June 10, 2015 Subject: Some data on apps and web To: mobile-l Hi all, thought I'd share some data from a few queries around apps uniques and apps + web

Re: [Analytics] Tracking unload events with EventLogging

2015-06-10 Thread Gergo Tisza
On Mon, Jun 8, 2015 at 3:57 PM, Erik Bernhardson wrote: > Searching around I saw some discussion about this almost a year ago, in > may 2014, before sendBeacon support was added (in nov 2014), > titled "[Analytics] Using EventLogging for funnel analysis". There it was > proposed to push the event

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
Probably, on the Discovery team mailing list. On 10 June 2015 at 14:56, Pine W wrote: > Question about "the budget this year has ensured, at least for Discovery, > that ops and hardware support are slashed to the bone." I'm trying to figure > out the paradox of hiring more peope for Discovery at

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Pine W
Question about "the budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone." I'm trying to figure out the paradox of hiring more peope for Discovery at the same time that ops and hardware support are reduced. Can someone explain? Thanks, Pine __

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
At the moment I don't have specific questions because we're trying to just get the thing set up. But, wider context and a prediction: The budget this year has ensured, at least for Discovery, that ops and hardware support are slashed to the bone. Because of this we're deploying bigger and bigger t

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Dan Andreescu
I think this thread is a bit too vague. If piwik is woefully inadequate, then what kind of analysis is needed for the use cases you're talking about? It doesn't seem obvious that we need endlessly scalable systems like Hadoop to analyze data gathered by small and fairly limited virtual machines.

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
On 10 June 2015 at 12:00, Andrew Otto wrote: > HmMmm. > > here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in > labs. We probably should! I don’t really want to maintain them myself, but > they should be pretty easy to set up using hiera now. I could maintain them > if n

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
On 10 June 2015 at 11:35, Dan Andreescu wrote: > > > On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes wrote: >> >> On 10 June 2015 at 10:53, Dan Andreescu wrote: >> > I see three ways for data to get into the cluster: >> > >> > 1. request stream, handled already, we're working on ways to pump the

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Andrew Otto
HmMmm. here’s no reason we couldn’t maintain beta level Kafka + Hadoop clusters in labs. We probably should! I don’t really want to maintain them myself, but they should be pretty easy to set up using hiera now. I could maintain them if no on else wants to. Thought two: > "so > when does n

Re: [Analytics] "Maybe Analytics" project in Phabricator

2015-06-10 Thread Andre Klapper
On Mon, 2015-04-27 at 11:28 -0700, Dan Andreescu wrote: > Sounds to me like the nuance we were trying to go for is causing > confusion. This is unintended and my opinion is that we should > remove maybe-analytics and just tell everyone to use blocked-on > -analytics as liberally as they wish. I

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Dan Andreescu
On Wed, Jun 10, 2015 at 11:02 AM, Oliver Keyes wrote: > On 10 June 2015 at 10:53, Dan Andreescu wrote: > > I see three ways for data to get into the cluster: > > > > 1. request stream, handled already, we're working on ways to pump the > data > > back out through APIs > > Awesome, and it'd end u

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
On 10 June 2015 at 10:53, Dan Andreescu wrote: > I see three ways for data to get into the cluster: > > 1. request stream, handled already, we're working on ways to pump the data > back out through APIs Awesome, and it'd end up in the Hadoop cluster in a table? How...do we kick that off most easi

Re: [Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Dan Andreescu
I see three ways for data to get into the cluster: 1. request stream, handled already, we're working on ways to pump the data back out through APIs 2. Event Logging. We're making this scale arbitrarily by moving it to Kafka. Once that's done, we should be able to instrument pretty much anything

[Analytics] "If it didn't happen in HDFS, it didn't happen"

2015-06-10 Thread Oliver Keyes
Hey all, We're building a lot of tools out on Labs. From a RESTful API to a Wikidata Query Service, we're making neat things and Labs is proving the perfect place to prototype them - in all-but-one-respects. A crucial part of these tools being not just useful but measurably useful is the logs bei

Re: [Analytics] [Technical] Pick storage for pageview cubes

2015-06-10 Thread Marcel Ruiz Forns
If we are going to completely denormalize the data sets for anonymization, and we expect just slice and dice queries to the database, I think we wouldn't take much advantage of a relational DB, because it wouldn't need to aggregate values, slice or dice, all slices and dices would be precomputed, r