Thanks all for the awesome comments :). Will get to tomorrow morning![1] [1] East coast time.
On 19 March 2015 at 20:37, aaron shaw <aarons...@northwestern.edu> wrote: > Adding to Giovanni's points (all of which I agree with 100%): > > - This would be awesome! The pageviews are a super useful for many of us and > cleaning them up a bit would save a lot of redundant work for many of us > down the road. > - If you don't have to collapse page views incoming from mobile and zero, I > would recommend keeping them separate. That said, I haven't spent any time > looking into it, and so I confess complete ignorance on this front. > - I agree with you that page ids are better than titles. Great idea. > - I don't think the byte information is/was useful in this dataset, so I > agree with dumping that. > - Backfill would be totally great. > > Happy to chat more if it seems helpful... > > a > > > > On Thu, Mar 19, 2015 at 7:13 PM, Giovanni Luca Ciampaglia > <gciam...@indiana.edu> wrote: >> >> Hi Oliver, >> >> Tab-separation would be welcomed. Title normalisation would be *very* >> useful too. Another thing that could potentially save a lot of space would >> be to throw out all malformed requests, pieces of javascript, and similar >> junk. Not sure how difficult that would be though, without doing an actual >> query on the DB for the page id. >> >> For example, an excerpt from 20140101-000000.gz (with only the title and >> views fields): >> >> 'اÙ�ØاÙ�Â_Ù�شباب'_Â_Ù�Ù�اطعÂ_Ù�ضØÙ�ة 1 >> >> '/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value) >> 9 >> '03_Bonnie_&_Clyde 18 >> A_Night_at_the_Opera_(Queen_album) 57 >> '40s_on_4 2 >> '50s_on_5 1 >> '71_(film) 4 >> '74_Jailbreak 3 >> '77 1 >> >> '79-00_é�å�¶å�©åºÃ¯Â¿Â½å_±é��vol.8_ACå�¬å�±åºÃ¯Â¿Â½å��æ©Ã¯Â¿Â½æ§Ã¯Â¿Â½ >> 1 >> >> Cheers, >> >> G >> >> >> >> Giovanni Luca Ciampaglia >> >> ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA >> ☞ http://www.glciampaglia.com/ >> ✆ +1 812 855-7261 >> ✉ gciam...@indiana.edu >> >> 2015-03-13 12:06 GMT-07:00 Oliver Keyes <oke...@wikimedia.org>: >> >>> So, we've got a new pageviews definition; it's nicely integrated and >>> spitting out TRUE/FALSE values on each row with the best of em. But >>> what does that mean for third-party researchers? >>> >>> Well...not much, at the moment, because the data isn't being released >>> somewhere. But one resource we do have that third-parties use a heck >>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org. >>> >>> Due to historical size constrains and decision-making (and by >>> historical I mean: last decade) these have a number of weirdnesses in >>> formatting terms; project identification is done using a notation >>> style not really used anywhere else, mobile/zero/desktop appear on >>> different lines, and the files are space-separated. I'd like to put >>> some volunteer time into spitting out dumps in an easier-to-work-with >>> format, using the new definition, to run in /parallel/ with the >>> existing logs. >>> >>> *The new format* >>> At the moment we have the format: >>> >>> project_notation - encoded_title - pageviews - bytes >>> >>> This puts zero and mobile requests to pageX in a different place to >>> desktop requests, requires some reconstruction of project_notation, >>> and contains (for some use cases) extraneous information - that being >>> the byte-count. The files are also headerless, unquoted and >>> space-separated, which saves space but is sometimes...I think the term >>> is "eeeeh-inducing". >>> >>> What I'd like to use as a new format is: >>> >>> full_project_url - encoded_title - desktop_pageviews - >>> mobile_and_zero_pageviews >>> >>> This file would: >>> >>> 1. Include a header row; >>> 2. Be formatted as a tab-separated, rather than space-separated, file; >>> 3. Exclude bytecounts; >>> 4. Include desktop and mobile pageview counts on the same line; >>> 5. Use the full project URL ("en.wikivoyage.org") instead of the >>> pagecounts-specific notation ("en.v") >>> >>> So, as a made-up example, instead of: >>> >>> de.m.v Florence 32 9024 >>> de.v Florence 920 7570 >>> >>> we'd end up with: >>> >>> de.wikivoyage.org Florence 920 32 >>> >>> In the future we could also work to /normalise/ the title - replacing >>> it with the page title that refers to the actual pageID. This won't >>> impact legacy files, and is currently blocked on the Apps team, but >>> should be viable as soon as that blocker goes away. >>> >>> I've written a script capable of parsing and reformatting the legacy >>> files, so we should be able to backfill in this new format too, if >>> that's wanted (see below). >>> >>> *The size constraints* >>> >>> There really aren't any. Like I said, the historical rationale for a >>> lot of these decisions seems to have been keeping the files small. But >>> by putting requests to the same title from different site versions on >>> the same line, and dropping byte-count, we save enough space that the >>> resulting files are approximately the same size as the old ones - or >>> in many cases, actually smaller. >>> >>> *What I'm asking for* >>> >>> Feedback! What do people think of the new format? What would they like >>> to see that they don't? What don't they need, here? How useful would >>> normalisation be? How useful would backfilling be? >>> >>> *What I'm not asking for* >>> WMF time! Like I said, this is a spare-time project; I've also got >>> volunteers for Code Review and checking, too (Yuvi and Otto). >>> >>> The replacement of the old files! Too many people depend on that >>> format and that definition, and I don't want to make them sad. >>> >>> Thoughts? >>> >>> -- >>> Oliver Keyes >>> Research Analyst >>> Wikimedia Foundation >>> >>> _______________________________________________ >>> Wiki-research-l mailing list >>> Wiki-research-l@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l >> >> > > > _______________________________________________ > Wiki-research-l mailing list > Wiki-research-l@lists.wikimedia.org > https://lists.wikimedia.org/mailman/listinfo/wiki-research-l > -- Oliver Keyes Research Analyst Wikimedia Foundation _______________________________________________ Wiki-research-l mailing list Wiki-research-l@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wiki-research-l