On Apr 30, 2014, at 8:40 AM, Sean Pringle <sprin...@wikimedia.org> wrote:

> On Wed, Apr 30, 2014 at 12:44 PM, Oliver Keyes <oke...@wikimedia.org> wrote:
> Okay, so, have tested (to a limited degree. The work I'm doing that involves 
> the dbs involves eventlogging, so this is mostly me making up excuses to run 
> queries). Thoughts:
> 
> *We should probably put in some kind of restrictions around what we care 
> about. For example, I see the tables relating to the WIkimania and Arbcom 
> wikis in there. This is not data I think we're ever going to care about, but 
> it is data, which means we'll either have to write really complex UNIONs to 
> gather global data, with a constantly-maintained list of 
> dbs-we-don't-care-about, or accept inaccuracies in our data. My suggestion 
> would be for these dbs to be removed and excluded from replication, using the 
> noc dblists to identify the ones we don't care about; generally 
> "deleted","closed","special","wikimedia" wikis aren't things we want to be 
> running queries over.
> 
> If there are wikis you guys know for sure nobody using ‘research' user will 
> ever want, then they can simply be hidden by modifying the account grants.

Oliver, I am not sure how we define “data we’re [n]ever going to care about”. I 
do expect we will receive occasional requests for data related to closed or 
special wikis (see https://office.wikimedia.org/wiki/File:Officewiki_ae.png 
just to mention a recent example).

The point about global queries is well taken, but I think it should be handled 
differently (see below). Since we’re not talking about privacy here (uncensored 
data can be obtained by anyone with access to the production DBs), but 
usability, I’d avoid making assumptions about which wikis should *always* be 
excluded. We should have an equivalent of the API’s sitematrix with project 
metadata to allow flexible filtering.

> *This is probably my bad, but I understood the goal to be having a single db 
> containing unified, core tablets. So, we'd have one db, with one revision 
> table, that'd have an extra column of "wiki" that denoted the project the 
> entry referred to. This would let us perform global queries without the 
> complex UNIONs mentioned above. Is this still the goal, or...?
> 
> No, that wasn't the goal. Sorry if there was miscommunication. The actual 
> data will remain in separate wikis using regular replication.
> 
> However, it's quite possible to create one or more unified databases with 
> (for example) SQL VIEWs that union all tables from a set of pre-defined 
> wikis, with 'wiki' columns, just as you describe. Same thing, really. We 
> could even allow ad-hoc creation of unified views for whatever .dblist is 
> appropriate for the project. I don't think anything need be ruled out yet -- 
> that's the whole point of SQL, right? Slow, but flexible. :-)

that would work, Oliver is right that creating views for core tables in 
pre-defined wikis (say, all wikipedias) would be valuable. Sean, how about we 
create a page on wikitech with requirements for these views and we take it from 
there?

Dario
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to