Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

Oliver Keyes Thu, 19 Mar 2015 19:07:18 -0700

Thanks all for the awesome comments :). Will get to tomorrow morning![1]

[1] East coast time.


On 19 March 2015 at 20:37, aaron shaw <aarons...@northwestern.edu> wrote:
> Adding to Giovanni's points (all of which I agree with 100%):
>
> - This would be awesome! The pageviews are a super useful for many of us and
> cleaning them up a bit would save a lot of redundant work for many of us
> down the road.
> - If you don't have to collapse page views incoming from mobile and zero, I
> would recommend keeping them separate. That said, I haven't spent any time
> looking into it, and so I confess complete ignorance on this front.
> - I agree with you that page ids are better than titles. Great idea.
> - I don't think the byte information is/was useful in this dataset, so I
> agree with dumping that.
> - Backfill would be totally great.
>
> Happy to chat more if it seems helpful...
>
> a
>
>
>
> On Thu, Mar 19, 2015 at 7:13 PM, Giovanni Luca Ciampaglia
> <gciam...@indiana.edu> wrote:
>>
>> Hi Oliver,
>>
>> Tab-separation would be welcomed. Title normalisation would be *very*
>> useful too. Another thing that could potentially save a lot of space would
>> be to throw out all malformed requests, pieces of javascript, and similar
>> junk. Not sure how difficult that would be though, without doing an actual
>> query on the DB for the page id.
>>
>> For example, an excerpt from 20140101-000000.gz (with only the title and
>> views fields):
>>
>> 'Ø§Ùï¿½ØØ§Ùï¿½Â_Ùï¿½Ø´Ø¨Ø§Ø¨'_Â_Ùï¿½Ùï¿½Ø§Ø·Ø¹Â_Ùï¿½Ø¶ØÙï¿½Ø© 1
>>
>> '/javascript:document.location.href='/'_encodeURIComponent(document.getElementById('txt_input_text').value)
>> 9
>> '03_Bonnie_&_Clyde 18
>> A_Night_at_the_Opera_(Queen_album) 57
>> '40s_on_4 2
>> '50s_on_5 1
>> '71_(film) 4
>> '74_Jailbreak 3
>> '77 1
>>
>> '79-00_éÃ¯Â¿Â½åÃ¯Â¿Â½¶åÃ¯Â¿Â½©åºÃ¯Â¿Â½å_±éÃ¯Â¿Â½Ã¯Â¿Â½vol.8_ACåÃ¯Â¿Â½¬åÃ¯Â¿Â½±åºÃ¯Â¿Â½åÃ¯Â¿Â½Ã¯Â¿Â½æ©Ã¯Â¿Â½æ§Ã¯Â¿Â½
>> 1
>>
>> Cheers,
>>
>> G
>>
>>
>>
>> Giovanni Luca Ciampaglia
>>
>> ✎ 919 E 10th ∙ Bloomington 47408 IN ∙ USA
>> ☞ http://www.glciampaglia.com/
>> ✆ +1 812 855-7261
>> ✉ gciam...@indiana.edu
>>
>> 2015-03-13 12:06 GMT-07:00 Oliver Keyes <oke...@wikimedia.org>:
>>
>>> So, we've got a new pageviews definition; it's nicely integrated and
>>> spitting out TRUE/FALSE values on each row with the best of em. But
>>> what does that mean for third-party researchers?
>>>
>>> Well...not much, at the moment, because the data isn't being released
>>> somewhere. But one resource we do have that third-parties use a heck
>>> of a lot, is the per-page pageviews dumps on dumps.wikimedia.org.
>>>
>>> Due to historical size constrains and decision-making (and by
>>> historical I mean: last decade) these have a number of weirdnesses in
>>> formatting terms; project identification is done using a notation
>>> style not really used anywhere else, mobile/zero/desktop appear on
>>> different lines, and the files are space-separated. I'd like to put
>>> some volunteer time into spitting out dumps in an easier-to-work-with
>>> format, using the new definition, to run in /parallel/ with the
>>> existing logs.
>>>
>>> *The new format*
>>> At the moment we have the format:
>>>
>>> project_notation - encoded_title - pageviews - bytes
>>>
>>> This puts zero and mobile requests to pageX in a different place to
>>> desktop requests, requires some reconstruction of project_notation,
>>> and contains (for some use cases) extraneous information - that being
>>> the byte-count. The files are also headerless, unquoted and
>>> space-separated, which saves space but is sometimes...I think the term
>>> is "eeeeh-inducing".
>>>
>>> What I'd like to use as a new format is:
>>>
>>> full_project_url - encoded_title - desktop_pageviews -
>>> mobile_and_zero_pageviews
>>>
>>> This file would:
>>>
>>> 1. Include a header row;
>>> 2. Be formatted as a tab-separated, rather than space-separated, file;
>>> 3. Exclude bytecounts;
>>> 4. Include desktop and mobile pageview counts on the same line;
>>> 5. Use the full project URL ("en.wikivoyage.org") instead of the
>>> pagecounts-specific notation ("en.v")
>>>
>>> So, as a made-up example, instead of:
>>>
>>> de.m.v Florence 32 9024
>>> de.v Florence 920 7570
>>>
>>> we'd end up with:
>>>
>>> de.wikivoyage.org Florence 920 32
>>>
>>> In the future we could also work to /normalise/ the title - replacing
>>> it with the page title that refers to the actual pageID. This won't
>>> impact legacy files, and is currently blocked on the Apps team, but
>>> should be viable as soon as that blocker goes away.
>>>
>>> I've written a script capable of parsing and reformatting the legacy
>>> files, so we should be able to backfill in this new format too, if
>>> that's wanted (see below).
>>>
>>> *The size constraints*
>>>
>>> There really aren't any. Like I said, the historical rationale for a
>>> lot of these decisions seems to have been keeping the files small. But
>>> by putting requests to the same title from different site versions on
>>> the same line, and dropping byte-count, we save enough space that the
>>> resulting files are approximately the same size as the old ones - or
>>> in many cases, actually smaller.
>>>
>>> *What I'm asking for*
>>>
>>> Feedback! What do people think of the new format? What would they like
>>> to see that they don't? What don't they need, here? How useful would
>>> normalisation be? How useful would backfilling be?
>>>
>>> *What I'm not asking for*
>>> WMF time! Like I said, this is a spare-time project; I've also got
>>> volunteers for Code Review and checking, too (Yuvi and Otto).
>>>
>>> The replacement of the old files! Too many people depend on that
>>> format and that definition, and I don't want to make them sad.
>>>
>>> Thoughts?
>>>
>>> --
>>> Oliver Keyes
>>> Research Analyst
>>> Wikimedia Foundation
>>>
>>> _______________________________________________
>>> Wiki-research-l mailing list
>>> Wiki-research-l@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>>
>>
>
>
> _______________________________________________
> Wiki-research-l mailing list
> Wiki-research-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>



-- 
Oliver Keyes
Research Analyst
Wikimedia Foundation

_______________________________________________
Wiki-research-l mailing list
Wiki-research-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wiki-research-l

Re: [Wiki-research-l] [Technical][Request for Comment] A new format for the pageview dumps

Reply via email to