[Analytics] Re: Aggregate data on edits by country and language

2023-05-18 Thread Dan Andreescu
Hi Kiril, We have editors by country here: https://dumps.wikimedia.org/other/geoeditors/readme.html and visualized here: https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors-by-country And we do have edits by country and language but we don't publish it except to the Global

[Analytics] Re: energy used to store

2023-02-02 Thread Dan Andreescu
Yep, so that's the best data I know of as well. The table that backs the public API is documented here . And we have a visualization of this in Wikistats, where you

[Analytics] Re: best programme ot work with data

2023-01-26 Thread Dan Andreescu
visidata is *amazing* for any vim users, I want to take this opportunity to ask if other folks use it... On Thu, Jan 26, 2023 at 2:15 PM Robert Garrigos wrote: > Thanks, Federico, this is very interesting. I'll take a look. > > > Robert Garrigós i Castro >

[Analytics] Re: Conceptual differences between «Pageview» «Pageview complete» dumps

2023-01-17 Thread Dan Andreescu
> > So, the goal is to keep an historic archive of pageview activity, right? > Correct And I can imagine the «pageview» has been designed to be suitable for data > processing (API's, etc). Isn't it? > Yes, more for automated parsing/compute than for human readability. > And if you're looking to

[Analytics] Re: About «The date(s) you used are valid, but we either do not have data for those date(s)»

2023-01-17 Thread Dan Andreescu
> > If there is no visit for 2022121700 I would have expected a correct > response with value=0. > > Is this the expected behavior or I have found a glitch? I found a few > other cases, so I prefer to ask here. > You're right that this is strange, the behavior *should* be consistent, but it is

[Analytics] Re: Conceptual differences between «Pageview» «Pageview complete» dumps

2023-01-17 Thread Dan Andreescu
Hi Ismael, You're right to be confused, we left the work and documentation in a messy state following the departure of team members that worked on this dataset. We have not yet been able to prioritize cleaning it up. The basic idea was that pageviews_complete was going to be a combined dataset,

[Analytics] Re: Missing Pageviews data for Jan 8, 2023

2023-01-09 Thread Dan Andreescu
Hi. There was an outage and about 8 hours of jobs failed. We're working on rerunning it. This will likely be kind of slow because they're big jobs and long pipelines. I'm not sure exactly how long to recover this many lost hours, since I've never seen it before, but we'll update this thread

[Analytics] Re: Pageviews per country

2022-12-21 Thread Dan Andreescu
> > The only way is to help with the ongoing (and complex) differential >> privacy work >> > > I have systems background but probably this could be outside my skills. > How could I help? > Hm, it's some tricky programming work, I'm not 100% sure of the

[Analytics] Re: Pageviews per country

2022-12-20 Thread Dan Andreescu
Hi Ismael, responses inline: On Tue, Dec 20, 2022 at 1:05 PM Ismael Olea wrote: > I'm completely new to analytics in Wikimedia. > Welcome! :) We are working with a heritage institution in a GLAM project and they are > interested in access statistics for the resources they have released in >

[Analytics] Re: Mediacounts fields

2022-11-04 Thread Dan Andreescu
ome poking around to see if there's a size in bytes that would be a good threshold, or a standard transcoding that is most used on articles, or anything that would allow us to filter to only the kinds of images you're interested in? If we find that, my thought is we can just update the data behind the

[Analytics] Re: Mediacounts fields

2022-11-04 Thread Dan Andreescu
ticle. > > > > Focusing only on media viewer clicks seems was a possible solution for > solving those issues. If you have other suggestions, they are welcome! > > > > Best > > > > Michele > > > > *From: *Dan Andreescu > *Date: *Thursday, 3 Novemb

[Analytics] Re: Mediacounts fields

2022-11-03 Thread Dan Andreescu
We don't have any public data on media viewer interactions specifically. We used to have instrumentation on that feature but we haven't tracked it since last year. To get access to some of the old sanitized data that was retained for research purposes, you'd have to file a formal research

[Analytics] Re: Data engineering risks when migrating to Kubernetes

2022-07-05 Thread Dan Andreescu
Some dashboards rely on the Dashiki extension . As far as I know this is only enabled on meta.wikimedia.org, but it would be good to know if it'll still work going forward or if we have to work

[Analytics] Re: [Datasets] [Wikistats] Mediawiki History delayed this month

2022-06-07 Thread Dan Andreescu
This has been resolved, new data is available now. On Wed, Jun 1, 2022 at 9:44 AM Dan Andreescu wrote: > Quick message to say that the data pipeline that feeds into Wikistats > <https://stats.wikimedia.org/#/all-projects>, mediawiki history dumps > <https://dumps.w

[Analytics] [Datasets] [Wikistats] Mediawiki History delayed this month

2022-06-01 Thread Dan Andreescu
Quick message to say that the data pipeline that feeds into Wikistats , mediawiki history dumps , and many datasets internal to wmf (like edit hourly

[Analytics] Re: Earlier access to Pageviews hourly raw data files

2022-05-13 Thread Dan Andreescu
On Fri, May 13, 2022 at 11:26 AM Maxim Aparovich wrote: > Dear Sir or Madam, > Hi! Writing to you with a question about Pageviews hourly raw data files > . First of all, > let me know if I chose the right person for a question. If not,

[Analytics] Re: Wikimedia AQS Pageviews API Question

2022-05-03 Thread Dan Andreescu
I'll add that you can submit feature requests and bug reports on phabricator , and tag our team, #Data-Engineering . On Tue, Apr 19, 2022 at 5:27 AM Joseph Allemandou wrote: > Hi Ben, > > pageview data

[Analytics] Re: WiViVi update

2021-12-23 Thread Dan Andreescu
Hi Antoine! We've been talking about that, yeah. We have some of the data coming out of our newer APIs, and we should be able to get the rest with some effort. I think just open up a Phabricator task and we'll collaborate there. The team's busy with more big picture platform work but this

[Analytics] Re: Geoeditors: includes bots?

2021-11-01 Thread Dan Andreescu
Thanks for pointing this out! We don't include data about bots, wherever possible to identify. I've added a couple words about it, but feel free to edit the page to make it more obvious:

[Analytics] Re: Data Gap in API page view data on Oct 21

2021-10-25 Thread Dan Andreescu
Just a quick report back: this was a job that failed on Friday, and we're restarting it soon. The data should be there after it runs. On Sun, Oct 24, 2021 at 12:42 PM Martin Urbanec wrote: > Very good question Joshua. The underlying API indeed doesn't have the data > for Oct 21: >

[Analytics] Re: Access Wikipedia Metadata - API/Dumps/Query Replicas?

2021-09-17 Thread Dan Andreescu
> > "Thus, note that incremental downloads of these dumps may generate > inconsistent data. Consider using EventStreams for real time updates on > MediaWiki changes (API docs)." > I can see how that's confusing. I'll try to re-word it and then answer your other questions below. So this is

[Analytics] Re: Access Wikipedia Metadata - API/Dumps/Query Replicas?

2021-09-17 Thread Dan Andreescu
Hi Cristina, have you had a chance to read https://dumps.wikimedia.org/other/mediawiki_history/readme.html more closely? It sounds a lot like what you might need. We're consolidating all the confusing pageview dumps into a single one as well:

Re: [Analytics] Pageview-complete entries labeled as "-"

2021-03-13 Thread Dan Andreescu
Thank you for your email and thoughtful analysis, I just wanted to say I saw it but got buried with other work. I'll try and reply early next week. On Thu, Mar 11, 2021 at 03:50 Ogier Maitre wrote: > Hello everybody, > > We are currently working on a wikipedia visualisation tool (which is >

Re: [Analytics] Pageviews data for February 9th

2021-02-10 Thread Dan Andreescu
Otto wrote: > See also: > > > https://lists.wikimedia.org/pipermail/analytics-announce/2021-February/59.html > > https://phabricator.wikimedia.org/T273711 > https://phabricator.wikimedia.org/T274322 > > > > On Wed, Feb 10, 2021 at 2:21 PM Dan Andreescu >

Re: [Analytics] Pageviews data for February 9th

2021-02-10 Thread Dan Andreescu
Thanks for the message. We had a major cluster upgrade yesterday, and there is an issue with the job that loaded pageview data into the API. We think we have a fix but it may be a day or two until we get it all deployed and data starts catching up. I'll reply here when we know more. On Wed,

Re: [Analytics] Pageview Dumps outage?

2021-01-08 Thread Dan Andreescu
Marcus: we're also refactoring this dataset to be much more convenient (much smaller with more data and all the history). Stay tuned here or follow the progress as we update: https://dumps.wikimedia.org/other/pageview_complete/ On Fri, Jan 8, 2021 at 3:04 PM Luca Toscano wrote: > Hi Marcus, >

Re: [Analytics] EventLogging blocked by ad blockers

2020-09-22 Thread Dan Andreescu
> > >Is it reasonable to say that ad blockers should not be blocking > EventLogging (since it's just an internal logging system)? > Addblockers prevent requests to beacons, them being used for internal > stats or otherwise (ad serving) so yes, it is pretty reasonable. A beacon > does not

Re: [Analytics] Computed Edit Counts vs Wikistats Edit Counts

2020-09-11 Thread Dan Andreescu
Hi Thorsten, thanks for the question. I see the shape of both of our graphs is very similar, with some slight differences in magnitude of some of the peaks. I think both your guess and Marcel's guess contribute to the small difference. And if you'd like to quantify it, you can always look at

Re: [Analytics] Wikisource pageviews by agent and method

2020-06-15 Thread Dan Andreescu
> > Alright. I just naively assumed that most bots would be classified as > "desktop", given there's practically never a good reason to crawl the > mobile domain, so I was surprised by the numbers. Either there's a lot > of "bot" activity on the mobile domain, or there's very little "user" >

Re: [Analytics] Wikisource pageviews by agent and method

2020-06-15 Thread Dan Andreescu
Nemo would this, our next-up priority for Wikistats , help? Basically, it would let you filter on two different dimensions, so you can look at just user desktop or spider mobile, etc. On Mon, Jun 15, 2020 at 7:41 AM Francisco Dans wrote: > That's

Re: [Analytics] Clickstream: mobile vs. desktop, empty referrers

2020-06-11 Thread Dan Andreescu
> > I found that it even drove Groupon to conduct an iffy experiment: they > "deindexed" themselves from Google for about 6 hours, finding that "Up To > 60% Of “Direct” [i.e., referrer '-'] Traffic Is Actually Organic Search"... > https://searchengineland.com/60-direct-traffic-actually-seo-195415

Re: [Analytics] 43k monthly active editors on the English Wikipedia

2020-05-28 Thread Dan Andreescu
I'm not an analyst, but I was looking at this out of curiosity and noticed that anon users dropped by almost the same number that logged-in users rose: https://stats.wikimedia.org/#/en.wikipedia.org/contributing/active-editors/normal|line|1-year|editor_type~anonymous*user|monthly Maybe people

Re: [Analytics] Anyone from the Analytics team interested in mentoring a GSoC project?

2020-03-27 Thread Dan Andreescu
For the record, I'm happy to mentor Abel unless someone else would like to do it. On Fri, Mar 27, 2020 at 8:38 AM wrote: > Hello analytics folks! > > My name is Abel Serrano Juste (alias Quasipodo in Wikimedia, akronix in > Github). > > I'm considering to do GSoC with WMF this summer and I'd

Re: [Analytics] Community health metrics kit: Input needed!

2020-02-25 Thread Dan Andreescu
In my last meeting with Joe they were still collecting requirements, but that was before Joe got shifted around the org and I'm guessing the project is now done. This effort is exactly what I'm talking about in my email to product-all and tech-all yesterday (subject: The Question and Answer

Re: [Analytics] Wikipedia Early Page View Data Set Inquiry

2020-01-17 Thread Dan Andreescu
your help, time and feedback! > > Best, > > Emily > > On Thu, Jan 16, 2020 at 9:37 AM Dan Andreescu > wrote: > >> Emily, I believe the pagecount data was never collected in a structured >> way before 2007. See for example this discussion about some archive

Re: [Analytics] Wikipedia Early Page View Data Set Inquiry

2020-01-16 Thread Dan Andreescu
Emily, I believe the pagecount data was never collected in a structured way before 2007. See for example this discussion about some archive data that took some pains to uncover: https://phabricator.wikimedia.org/T232563 If edits per article would work as a proxy for attention, or in combination

Re: [Analytics] [Wiki-research-l] Active meta users v active wikimedia users

2020-01-06 Thread Dan Andreescu
Some near to mid-term changes that might be useful for everyone chiming in: * Active Editors across all wikis is a metric we're working on, and it will be part of Wikistats 2 sometime this fiscal year * The new mediawiki history dataset will let you download data for all wikis and crunch these

Re: [Analytics] Pageviews API missing data for some pages and dates?

2020-01-02 Thread Dan Andreescu
@Vipul: thanks for flagging this. We accidentally merged a change that ignored pages with a + in their title for the time period that Marcel mentioned: April 24th to June 6th. The relevant commits in our history are these: accident:

[Analytics] [Hadoop] [Kerberos] Authentication enabled

2019-12-17 Thread Dan Andreescu
Good news everyone, we enabled Kerberos! If you use the Hadoop cluster in any way, you'll need to kinit to get a Kerberos token so you can authenticate yourself or your job. Here's the guide again: https://wikitech.wikimedia.org/wiki/Analytics/Systems/Kerberos/UserGuide. If you use our APIs or

Re: [Analytics] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-12-17 Thread Dan Andreescu
This is now done. I'll send a separate announcement, but I'm quite proud of my awesome team right now, great work yall. On Mon, Dec 16, 2019 at 10:58 AM Dan Andreescu wrote: > *For Notebook Users:* you need to kinit from the Jupyter interface, > through a terminal. I added docs here: &

Re: [Analytics] Availability of hourly pagecounts files

2019-12-16 Thread Dan Andreescu
Collin, we are in the middle of a big upgrade today, changing the whole Analytics cluster to use Kerberos. So we're expecting delays on all datasets/jobs/services today. If you watch this list, there will be a message going out later to let everyone know things are back to normal. Thanks for

Re: [Analytics] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-12-16 Thread Dan Andreescu
*For Notebook Users:* you need to kinit from the Jupyter interface, through a terminal. I added docs here: https://wikitech.wikimedia.org/w/index.php?title=SWAP=revision=1848295=1847841 (highlighting this since I guessed, incorrectly, that I could ssh, kinit, and then tunnel separately. This

[Analytics] [Data Release] Active Editors by country

2019-11-07 Thread Dan Andreescu
Today we are releasing a new dataset meant to help us understand the impact of grants and programs on editing. This data was requested several years ago, and we took a long time to bring in the privacy and security experts whose help we needed to release it. With that work done, you can download

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-15 Thread Dan Andreescu
Hi Marc, To follow up on something Nuria said that may have gotten lost: the xml dumps have this information, but you'll have to parse the content. See the example in the docs: https://meta.wikimedia.org/wiki/Data_dumps/Dump_format#Format_of_the_XML_files. If you expand the "content dump" link,

[Analytics] Pageviews and unique devices to a specific set of pages

2019-07-08 Thread Dan Andreescu
Forwarding a quick question from Peter so we can answer it publicly or take advantage of work others have done: [Can we] estimate how many visitors visit pages with equations (i.e., wikitext math tags)? When we're talking about "how many visitors" we're talking about our Unique Devices data

Re: [Analytics] WMF API update

2019-05-06 Thread Dan Andreescu
t. I would definitely be interested in the full data > set. How big is the file set and how would we obtain it? > > > > Thanks, > > Celeste > > > > > > *From:* Dan Andreescu > *Sent:* Monday, May 06, 2019 8:26 AM > *To:* A mailing list for the Analytics Team at

Re: [Analytics] WMF API update

2019-05-06 Thread Dan Andreescu
Celeste, thanks for writing to the list. Would you prefer a full dataset instead of querying the API? We are planning on releasing the data behind the API as a set of flat files, and it seems like that would be more useful for consumers like you. Let us know if that's the case, and if not we'd

Re: [Analytics] need metric definition clarification

2019-03-27 Thread Dan Andreescu
Thanks for the ping, Kaldari, I've updated that page. The graph and underlying data do NOT count deleted pages. So if 5 pages were created and 2 were deleted, the graph would show 3. The reason I was confused and the code does not show an explicit filter for deleted pages is because we exclude

Re: [Analytics] New User Analysis

2019-03-14 Thread Dan Andreescu
Hello, Questions about quarry querying are best left on the Quarry talk page here: https://www.mediawiki.org/wiki/Talk:Quarry And here's some example pseudo-code that looks at active editors over a rolling window:

Re: [Analytics] Stats of mediawiki API / Access to non-public data

2019-03-07 Thread Dan Andreescu
Hi Viviana, Great project! The first thought I had looking at your question is that you can collect all the data you're asking about. If your service is making API calls and people are clicking on Wikipedia links from your interface, you can just collect that information and process it

Re: [Analytics] Wikipedia throttling

2019-02-27 Thread Dan Andreescu
Hi John, While dumps is how you could start, you need daily updates, so dumps won't do. I would suggest a Lambda architecture using dumps and EventStreams . The streams are pushed out via a

Re: [Analytics] Purging schedule for geoeditors_daily dataset

2018-11-21 Thread Dan Andreescu
That's good Neil. In general though, be careful with any public releases of this particular table, it's more sensitive than recentchanges. On Wed, Nov 21, 2018 at 2:45 PM Neil Patel Quinn wrote: > On Wed, 21 Nov 2018 at 11:39, Neil Patel Quinn > wrote: > >> (correcting my earlier error

Re: [Analytics] Purging schedule for geoeditors_daily dataset

2018-11-20 Thread Dan Andreescu
That's right, Neil, I just changed the language around a bit, thanks for updating that! On Tue, Nov 20, 2018 at 3:26 PM Neil Patel Quinn wrote: > Hey there! > > Could someone from Analytics clarify the purging schedule for > geoeditors_daily and add it on Wikitech >

Re: [Analytics] Request for pageview statistics pre 2015

2018-11-13 Thread Dan Andreescu
specifically, https://dumps.wikimedia.org/other/pagecounts-ez/ is much smaller to download, and it should be fairly simple to write a script to read the compressed format (also described there in detail) On Tue, Nov 13, 2018 at 9:52 AM Andrew Otto wrote: >

Re: [Analytics] Wiktionary word page views?

2018-10-24 Thread Dan Andreescu
One little thing most people don't know is if you click on the link next to "Page views in the past 30 days" you get a little graph On Wed, Oct 24, 2018 at 3:02 AM Timo Tijhof wrote: > Note that the page view information can also be found within the user > interface, on the "Page information"

Re: [Analytics] Statistics about republication of Wikimedia content

2018-10-17 Thread Dan Andreescu
The short answer is no. The long answer is that we would like there to be. I personally think we should evolve our concept of "PageView" to something more like "ContentView", and add properties such as publisher (google, apple, wikimedia, etc.), duration, etc. With our Modern Event Platform

Re: [Analytics] When is the new pages API updated?

2018-10-10 Thread Dan Andreescu
It should be updated soon, the jobs are all done successfully. But currently we do expect this kind of lag, I'll explain why. When we started we were sqooping at the beginning of the month and the processing takes something like 4 days total, most of it sqooping. But this put too much load on

Re: [Analytics] Question about data in pageview api

2018-09-25 Thread Dan Andreescu
The difference is very small, but you're right to point it out, I've opened a task to look into it: https://phabricator.wikimedia.org/T205457 On Wed, Sep 19, 2018 at 5:10 PM Felix J. Scholz wrote: > Hey, > > I've been looking through the documentation on the pageview api in recent > days, and

Re: [Analytics] Wikimedia Video tracking tool

2018-07-02 Thread Dan Andreescu
Oops, forgot to link the task, this is what I created: https://phabricator.wikimedia.org/T198628 On Mon, Jul 2, 2018 at 11:00 AM Dan Andreescu wrote: > Hi Agnes, > > We don't have any datasets that track video plays, yet. The closest thing > we have that's public is to count med

Re: [Analytics] Wikimedia Video tracking tool

2018-07-02 Thread Dan Andreescu
Hi Agnes, We don't have any datasets that track video plays, yet. The closest thing we have that's public is to count media views, so the number of times someone loaded the page that the video is on, and you can find that here: https://dumps.wikimedia.org/other/mediacounts/ I've created this

Re: [Analytics] Fwd: WikiMetrics and WikiData

2018-05-03 Thread Dan Andreescu
Ok, Agnes, all set, I created a cohort with those two users for Wikidata and indeed they have lots of edits over the past month. Thanks for the report! On Thu, May 3, 2018 at 10:07 AM, Dan Andreescu <dandree...@wikimedia.org> wrote: > Hi Agnes, > > Thanks for the report. This

Re: [Analytics] Fwd: WikiMetrics and WikiData

2018-05-03 Thread Dan Andreescu
Hi Agnes, Thanks for the report. This is indeed a bug, some configurations changed around on us and we didn't notice. I'm submitting a patch now and will ping here when it's fixed. On Fri, Apr 27, 2018 at 7:20 AM, Agnes Bruszik < agnes.brus...@wikimedia.org.uk> wrote: > Hello Dear Analytics

Re: [Analytics] Wikistats Data Outage issues

2018-04-23 Thread Dan Andreescu
It looks like we had a bad data load in March. So we've temporarily turned off the metrics that were impacted and we're reloading the data. We'll re-enable the metrics in the UI once the data looks good. Sorry for any inconvenience. On Mon, Apr 23, 2018 at 4:07 PM, Nuria Ruiz

Re: [Analytics] "pagecounts-ez' data not appearing

2018-04-17 Thread Dan Andreescu
Andrew - thanks for bringing this to our attention, the issue is now resolved. This was a one-time problem, we weren't able to update some scripts in time for infrastructure changes, so it shouldn't happen again. Thanks again. On Sat, Apr 7, 2018 at 6:26 AM, Andrew G. West

Re: [Analytics] Licensing for screenshots of pageviews data

2018-04-13 Thread Dan Andreescu
Just to back up what Nuria's saying, that logic was reviewed by legal and they suggested we use the language that Nuria quotes from Wikistats 2. So I would assume, though I am not a lawyer, that the same language could be used on all data-backed UIs such as the pageviews tool. On Fri, Apr 13,

Re: [Analytics] [Services] Getting more than just 1000 top articles from REST API

2018-04-02 Thread Dan Andreescu
t 8:54 AM, Leila Zia <le...@wikimedia.org> wrote: > >> >> >> On Mon, Apr 2, 2018 at 7:47 AM, Dan Andreescu <dandree...@wikimedia.org> >> wrote: >> >>> Hi Srdjan, >>> >>> The data pipeline behind the API can't handle arbitrary ski

Re: [Analytics] [Services] Getting more than just 1000 top articles from REST API

2018-04-02 Thread Dan Andreescu
Hi Srdjan, The data pipeline behind the API can't handle arbitrary skip or limit parameters, but there's a better way for the kind of question you have. We publish all the pageviews at https://dumps.wikimedia.org/other/pagecounts-ez/, look at the "Hourly page views per article" section. I would

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site

2018-03-30 Thread Dan Andreescu
rted looking into it. > > > * • **Zainan Zhou(**周载南**) a.k.a. "Victor" * <http://who/zzn> > * • *Software Engineer, Data Engine > * •* Google Inc. > * • *z...@google.com <ecarm...@google.com> - 650.336.5691 > * • * 1600 Amphitheathre Pkwy, LDAP zzn, Mountain View 9

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site

2018-03-29 Thread Dan Andreescu
Forwarding this question to the public Analytics list, where it's good to have these kinds of discussions. If you're interested in this data and how it changes over time, do subscribe and watch for updates, notices of outages, etc. Ok, so on to your question. You'd like the *total # of articles

Re: [Analytics] Latency of hourly vs daily endpoints?

2018-03-26 Thread Dan Andreescu
> > (I ask this >> because today we have a lot of interest in append-only logs, like in >> Dat, Secure Scuttlebutt, and of course blockchains—systems where >> information cannot be repudiated after it's published. If Wikipedia >> rejects append-only logs and allows official history to be changed,

Re: [Analytics] Latency of hourly vs daily endpoints?

2018-03-22 Thread Dan Andreescu
really good summary of the situation, Neil, I'm bookmarking this and will re-use it when people ask :) On Thu, Mar 22, 2018 at 7:07 AM, Neil Patel Quinn wrote: > On 22 March 2018 at 13:41, Neil Patel Quinn wrote: > >> >> Both the edit data and

Re: [Analytics] Question about the "Page Views" tool

2018-03-07 Thread Dan Andreescu
Basically, if you make an API with the same spec as our pageview API, https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews, you don't have to replicate any of the other systems. It's just a REST interface, it can be implemented in any language/framework fairly quickly. So do that, then

Re: [Analytics] How to get old page views data?

2018-02-23 Thread Dan Andreescu
gmail.com >> > wrote: >> >>> Like dumps on article-day level? That would be already super awesome >>> much better than the current state. >>> >>> Best, Peter >>> >>> Am 22.02.2018 22:23 schrieb "Dan Andreescu" <dandre

Re: [Analytics] How to get old page views data?

2018-02-22 Thread Dan Andreescu
a lot of people to start researching. Daily page counts are not > that fancy but without them people are simply blocked. They cannot start > because they cant even get a basic idea about what was the general article > popularity for a given day. > > > Best Peter > > > &

Re: [Analytics] How to get old page views data?

2018-02-22 Thread Dan Andreescu
eated after September 2013, correct? > > John > > On Wed, Feb 21, 2018 at 2:26 PM, Dan Andreescu <dandree...@wikimedia.org> > wrote: > >> Hi Lars, >> >> You have a couple of options: >> >> 1. download the data in lossless compressed form

Re: [Analytics] How to get old page views data?

2018-02-21 Thread Dan Andreescu
Hi Lars, You have a couple of options: 1. download the data in lossless compressed form, https://dumps.wikimedia. org/other/pagecounts-ez/ The format is clever and doesn't lose granularity, should be a lot quicker than pagecounts-raw (this is basically what stats.grok.se did with the data as

Re: [Analytics] Pageview dumps lagging behind

2018-02-16 Thread Dan Andreescu
at 11:43 AM, Dan Andreescu <dandree...@wikimedia.org> wrote: > Hi, how are you deducing that, I show files up to 2018-02-16 14:00:00 > (UTC) which is very up to date, only a few hours ago. > > On Sun, Feb 11, 2018 at 4:57 AM, Spinner Cat <pogf...@gmail.com> wrote: &g

Re: [Analytics] Interesting (?) third party SEO study

2018-02-16 Thread Dan Andreescu
cool, thanks for sending, anything where we're #1 is on-topic :) I do wonder what these numbers are that they display with each site within each country, it doesn't match any metric we have, so I posted a comment on the site. On Fri, Feb 16, 2018 at 5:52 AM, Jaime Crespo

Re: [Analytics] abnormal traffic to https://en.wikivoyage.org/wiki/Zimbabwe

2018-02-13 Thread Dan Andreescu
filed a task: https://phabricator.wikimedia.org/T187244#3969063 On Tue, Feb 13, 2018 at 2:54 PM, Ryan Kaldari wrote: > Since the beginning of February, English Wikivoyage has seen it's daily > pageviews double: > http://tools.wmflabs.org/siteviews/?platform=all- >

Re: [Analytics] Wikimedia pageviews API slow to update

2018-02-07 Thread Dan Andreescu
Hi Collin, Indeed usually the processing gets fresh data into the API within a few hours. However, sometimes, and especially at the beginning of a month, we have lots of jobs running in parallel and that slows things down a bit. Up to 24 hours of delay would be unusual but nothing too

Re: [Analytics] Tool to visualize which wiki pages link to which wiki pages?

2017-11-27 Thread Dan Andreescu
Hi Andre. Jaime's query is a good starting point, it would get you the data you need for one wiki. We can import the templatelinks table and then we can run it on Hadoop and get all wikis at once (we already have the other tables). But once we got that, we'd have a graph with millions of nodes

Re: [Analytics] Quick questions on the pagecount files

2017-11-17 Thread Dan Andreescu
Hi Ugur, The pagecounts-raw data is deprecated and hasn’t been updated for a few years. Have you seen the pagecounts-ez data? It is a merger of old pagecounts-raw and newer better pageviews data. You can find it here: https://dumps.wikimedia.org/other/pagecounts-ez/ As for the -1 view counts,

Re: [Analytics] Stopping mysql on db1047 (analytics slave) for maintenance

2017-10-26 Thread Dan Andreescu
(db1047 has the CNAME analytics-slave.eqiad.wmnet, in case people don't recognize the machine name) On Thu, Oct 26, 2017 at 3:48 AM, Luca Toscano wrote: > Hi everybody, > > mysql will not be available on db1047 for some time due to maintenance for >

Re: [Analytics] Heads up: mw.track client-side EventLogging mechanism "ignored" certain events

2017-10-12 Thread Dan Andreescu
Thanks for the post, this bug will definitely bias any data people got with mw.track. If the data is found to be so broken as to be useless, should we delete it up through the date the fix goes live? Asking people who use mw.track, not Sam On Thu, Oct 12, 2017 at 6:41 AM, Sam Smith

Re: [Analytics] Anybody know about stats.grok.se going down?

2017-08-13 Thread Dan Andreescu
of your existing or planned legacy metrics?VipulOn Sat, Aug 12, 2017 at 6:17 PM, Dan Andreescu <dandree...@wikimedia.org> wrote:Hi Vipul, actually that's also available via the API now! https://wikitech.wikimedia.org/wiki/Analytics/AQS/Legacy_PagecountsIt's a different path though, to highlight th

Re: [Analytics] Anybody know about stats.grok.se going down?

2017-08-12 Thread Dan Andreescu
rge raw dumps. I have built-in > integrations that get data from stats.grok.se; processing raw dumps to > generate pageview counts is possible but a lot of extra work :). > > Cheers, > > Vipul > > On Mon, Aug 7, 2017 at 4:17 AM, Dan Andreescu <dandree...@wikimedia.org>

Re: [Analytics] Wikimetrics down

2017-08-10 Thread Dan Andreescu
not, it's probably not your fault, don't worry about trying again, I'll send out an update if we figure it out. * some reports seem to fail, but others work, again until we can say with certainty we've fixed this database problem, it'll be hard to say what's happening. On Thu, Aug 10, 2017 at 4:

[Analytics] Wikimetrics down

2017-08-10 Thread Dan Andreescu
I'm going to stop the Wikimetrics website and service because it's having some serious problems working with the databases to get data. Most of the jobs it launches end up failing partially, which makes it hard for the user to know what's going on. I'll update this message once we can put it

Re: [Analytics] Anybody know about stats.grok.se going down?

2017-08-07 Thread Dan Andreescu
And if you need more of an API / raw data download, take a look at: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews (available at https://wikimedia.org/api/rest_v1/) and: https://dumps.wikimedia.org/other/pagecounts-ez/ On Mon, Aug 7, 2017 at 4:21 AM, Dan Garry

[Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Dan Andreescu
Today we announce a new snapshot (named *2017-06*) of the mediawiki history data [1]. It includes these awesome new fields: *event_user_revision_count*: 'Cumulative revision count per user for the current event_user_id (only available in revision-create events so far)' *page_revision_count*:

[Analytics] [DEPRECATED] datasets.wikimedia.org

2017-06-22 Thread Dan Andreescu
Hi all, *Who: *This mostly applies to people who have access to the stat1002 and stat1003 statistics machines on the production cluster, and publish datasets as static files. *What:* We are no longer using datasets.wikimedia.org to serve static datasets. We have set up a redirect, so requests

Re: [Analytics] Can't package/contribute to latest analytics/refinery/source

2017-03-31 Thread Dan Andreescu
Mikhail, I know Joseph was working with that job whose test is failing, I'll cc him personally to see if he has any idea. Does this work if you do it on the cluster? Since it's the weekend, you probably won't get a response until Monday, let me know if there's anything urgent and we can try to

Re: [Analytics] Data Lake documentation on Wikitech

2017-03-25 Thread Dan Andreescu
The Analytics cluster refers more to the whole infrastructure, including the raw streams of data, the processing of that data, and all the software that goes into configuring, monitoring, and maintaining it. From

Re: [Analytics] Top editors in a certain namespace across sites?

2017-03-23 Thread Dan Andreescu
We are working real hard to make cross-site querying easy from quarry, by pointing it to the new data we're working on. So we hope to have that out as soon as the new labs db servers have data for all projects. A quick question on this topic: how far back do you all need to go? Whole history

Re: [Analytics] New analytics data

2017-03-21 Thread Dan Andreescu
Hi Carsten, We deprecated that dataset, we now have better data, up to date. More information is available here: https://dumps.wikimedia.org/other/analytics/ and here: https://wikitech.wikimedia.org/wiki/Analytics/Data/Pageviews As you'll find by reading there, the data is available via a new

Re: [Analytics] Os stats

2017-03-16 Thread Dan Andreescu
To: Dan Andreescu Cc: A mailing list for the Analytics Team at WMF and everybody who has an interest in Wikipedia and analytics.; Tomas Popela Subject: Re: [Analytics] Os stats Been thinking a bit about this and while I do appreciate the privacy concerns I would assume that even if you set

Re: [Analytics] Os stats

2017-03-14 Thread Dan Andreescu
Christian, I wanted to make sure our code is working well so I took a look. We use UA Parser, a regex-based community-maintained user agent identifier. It correctly identified Fedora as the OS in all of the strings I found like '%Fedora%' for the hour of raw webrequests I looked at. However,

Re: [Analytics] Request for analytics data

2017-03-06 Thread Dan Andreescu
top" numbers, not "Mobile Web" and > "Mobile App" when it comes to the platform. > > Is that correct ? > > Is there a way to get a sum over all three platforms ? > > Thanx, Cheers, JJ > > Am 06.03.2017 um 17:38 schrieb Jörg Jung: > > Ok, guy

Re: [Analytics] Request for analytics data

2017-03-06 Thread Dan Andreescu
three projects for "de" and what is > the difference between them ? /de/,/de.m/ and /de.zero/ > > Cheers, JJ > > Am 06.03.2017 um 15:45 schrieb Dan Andreescu: > > Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ > > which has compressed data

Re: [Analytics] Request for analytics data

2017-03-06 Thread Dan Andreescu
Jorg, take a look at https://dumps.wikimedia.org/other/pagecounts-ez/ which has compressed data without losing granularity. You can get monthly files here and download a lot less data. On Mon, Mar 6, 2017 at 5:40 AM, Jörg Jung wrote: > Marcel, > > thanx for ur quick

  1   2   3   4   >