Re: [Analytics] EventLogging blocked by ad blockers

2020-09-22 Thread Nuria Ruiz
Hello, What are the problems you see with the beacon being blocked when it comes to extracting value from data? In most instances what we look when deriving insights are ratios. For example: "of the people that saw the red link how many clicked it". In this scenario, with an adequate sample

Re: [Analytics] Translations in wikistats

2020-08-31 Thread Nuria Ruiz
; <https://www.avast.com/sig-email?utm_medium=email_source=link_campaign=sig-email_content=webmail> > <#m_3789262350594832144_m_1878655813009564070_DAB4FAD8-2DD7-40BB-A1B8-4E2AA1F9FDF2> > > El vie., 28 ago. 2020 a las 19:45, Nuria Ruiz () > escribió: > >> Ruben: >&

[Analytics] Translations in wikistats

2020-08-28 Thread Nuria Ruiz
Ruben: Thanks for your question about translations in wikistats ( http://stats.wikimedia.org). You can contribute translations to wikistats via translate wiki. https://translatewiki.net/wiki/Translating:Wikistats_2.0 I think on our end we need to do a bit better at making obvious this is the

Re: [Analytics] nefarious bot/automated traffic analysis

2020-06-16 Thread Nuria Ruiz
Scott: A good place to start to read about "bot spam" and its impact on the data is this one: https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection We recently released a new classification for traffic. Besides classifying traffic as "user" or "spider" we also have now

Re: [Analytics] Clickstream: mobile vs. desktop, empty referrers

2020-06-09 Thread Nuria Ruiz
Hello, See https://phabricator.wikimedia.org/T195880 for info on "none" referrers. Thanks, Nuria On Tue, Jun 9, 2020 at 6:10 AM Joseph Allemandou wrote: > Hi Robert > > From the `WHERE` clause here: > >

Re: [Analytics] "automated" marker added to pageview data

2020-05-18 Thread Nuria Ruiz
uest more details if they have a legitimate need for them. > > On Tue, 5 May 2020 at 02:40, Nuria Ruiz wrote: > >> Hello: >> >> We have added the 'automated' maker to Wikimedia's pageview data. Up to >> now pageview agents were classified as 'spider' (self reporte

[Analytics] "automated" marker added to pageview data

2020-05-04 Thread Nuria Ruiz
Hello: We have added the 'automated' maker to Wikimedia's pageview data. Up to now pageview agents were classified as 'spider' (self reported bots like 'google bot' or 'bing bot') and 'user'. We have known for a while that some requests classified as 'user' were, in fact, coming from automated

Re: [Analytics] [Research-Internal] Kerberos ticket expiry, Jupyterhub on stat1004/1006 and new memory/cpu limits for stat/notebook hosts

2020-03-12 Thread Nuria Ruiz
Hello, >We deployed jupyterhub on stat1004 and stat1006, So we are all clear on what this implies it means that disk space constrains in jupyter notebooks are no longer an issue. The stats machines have much more disk available than the notebook hosts. That being said that answer to larger

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-25 Thread Nuria Ruiz
Hello: Following up on this issue, We think many of neil's issues come from the fact that a kerberos ticket expires after 24 hours and once it does your spark session would not work anymore. We will be extending expiration of tickets somewhat to 2/3 days but main point to take home is that

Re: [Analytics] Announcement - Mediawiki History Dumps

2020-02-17 Thread Nuria Ruiz
Hello, We have added a footer to dumps pages with the CC-0 note. Please see: https://dumps.wikimedia.org/other/analytics/ For other changes that you think are needed please do file a phab ticket. Thanks, Nuria On Tue, Feb 11, 2020 at 2:50 PM Nuria Ruiz wrote: > Regarding Licens

Re: [Analytics] Announcement - Mediawiki History Dumps

2020-02-11 Thread Nuria Ruiz
Regarding Licensing, there is already a ticket: https://phabricator.wikimedia.org/T244685 If you take a look the bottom of wikistats (https://stats.wikimedia.org/v2) you will see that dedication is CC0, the data in both systems is the same but, of course, it can be made more explicit. Thanks,

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Nuria Ruiz
wikitech.wikimedia.org/wiki/Analytics#Contact> so it stays clear. > > On Fri, 7 Feb 2020 at 07:48, Nuria Ruiz wrote: > >> Hello, >> >> Probably this discussion is not of wide interest to this public list, I >> suggest to move it to analytics-internal? >> >> Tha

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Nuria Ruiz
Hello, Probably this discussion is not of wide interest to this public list, I suggest to move it to analytics-internal? Thanks, Nuria On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto wrote: > Hm, interesting! I don't think many of us have used > SparkSession.builder.getOrCreate > repeatedly in

Re: [Analytics] Hourly projectviews by country

2020-01-13 Thread Nuria Ruiz
>Is there any way I can get an hourly time series of which countries are viewing which Wikipedias? Even a (country x project) resolution summary of average views > for the 24 hours of the day would be helpful, if that data exists anywhere. The public data that exists on this regard is aggregated

Re: [Analytics] [Wiki-research-l] Active meta users v active wikimedia users

2020-01-06 Thread Nuria Ruiz
>I was looking to try and work out what percent lf the active wikimedia community are participating on meta and comparing to another wiki farm. Any thoughts on that? I think it will help to give a bit of an example of why you are looking to find this information, why is it important. Participating

Re: [Analytics] Pageviews anomaly‏

2019-12-22 Thread Nuria Ruiz
Hello, This spike is probably caused by bot traffic. I would disregard it entirely. Please see, for example, a similar problem in all top pageviews in hungarian wikipedia for last month. https://phabricator.wikimedia.org/T237282 Thanks, Nuria On Sun, Dec 22, 2019 at 2:42 PM Brian Keegan

Re: [Analytics] Availability of hourly pagecounts files

2019-12-16 Thread Nuria Ruiz
> thought that the hourly files were the source of data for the tool. Is there any estimate of when the missing files will be available? The source of data for the tool is the pagevioew API: https://wikitech.wikimedia.org/wiki/Analytics/AQS/Pageviews#Pageview_counts_by_article Thanks, Nuria On

[Analytics] Releasing a dataset for caching research and tunning

2019-12-05 Thread Nuria Ruiz
Hello, The Analytics team would like to announce the release of a new dataset for caching research and tunning. Please take a look, these datasets are used by the research community for evaluations of caching algorithms. https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Caching

Re: [Analytics] Statistics

2019-08-27 Thread Nuria Ruiz
Emin: You can see identified bot traffic versus user traffic in this graph: https://stats.wikimedia.org/v2/#/az.wikipedia.org/reading/total-page-views/normal|bar|2-year|agent~user*spider|monthly , sometimes bot traffic is about 30% of the traffic. As the prior reply said we know some of the user

Re: [Analytics] [Wiki-research-l] Analytics clients (stat/notebook hosts) and backups of home directories

2019-07-10 Thread Nuria Ruiz
>I have one question for you: As you allow/encourage for more copies of >the files to exist To be extra clear, we do not encourage for data to be in that notebooks hosts at all, there is no capacity of them to neither process nor hosts large amounts of data. Data that you are working with is best

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-09 Thread Nuria Ruiz
I'll let you know when I have more info. > > Thanks again. > Best, > > Marc Miquel > > > Missatge de Nuria Ruiz del dia dt., 9 de jul. 2019 > a les 1:44: > >> >Will there be a release for these two tables? >> No, sorry, there will not be. The data

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
parameters for the entire > table or for specific parts (using batches). > > Will there be a release for these two tables? Could I connect to the > Hadoop to see if the queries on pagelinks and categorylinks run faster? > > If there is any other alternative we'd be happy to try as we cann

Re: [Analytics] project Cultural Diversity Observatory / accessing analytics hadoop databases

2019-07-08 Thread Nuria Ruiz
Hello, >From your description seems that your problem is not one of computation (well, your main problem) but rather data extraction. The labs replicas are not meant for big data extraction jobs as you have just found out. Neither is Hadoop. Now, our team will be releasing soon a dataset of edit

Re: [Analytics] Superset 0.32 upgrade coming tomorrow (May 15th, early EU morning)

2019-05-15 Thread Nuria Ruiz
Hello, Superset is now been upgraded, there are notable fixes on this version and now you can go crazy creating histograms cause they actually work. An example: histogram of response sizes as reported by varnish last week: https://bit.ly/2vYB966 Also, there is a new dataset available called

Re: [Analytics] [ISSUE] dumps.wikimedia.org stop working

2019-04-04 Thread Nuria Ruiz
Hello, This issue should be corrected by now. Please check. Thanks, Nuria On Wed, Apr 3, 2019 at 9:18 AM Nuria Ruiz wrote: > > Sorry this has broken, Erik Z. retired recently and we are moving some of > the work he did to run somewhat differently. You can follow this issue:

Re: [Analytics] [ISSUE] dumps.wikimedia.org stop working

2019-04-03 Thread Nuria Ruiz
Sorry this has broken, Erik Z. retired recently and we are moving some of the work he did to run somewhat differently. You can follow this issue: https://phabricator.wikimedia.org/T220012 On Wed, Apr 3, 2019 at 6:36 AM Mauro Mascia wrote: > Hi, > > it seems that the daily dumps of pagecounts,

Re: [Analytics] Trouble getting yesterday's pageviews data

2019-04-02 Thread Nuria Ruiz
Outage docs now available: https://wikitech.wikimedia.org/wiki/Incident_documentation/20190402-0401KafkaJumbo On Tue, Apr 2, 2019 at 6:15 AM Luca Toscano wrote: > Hi Collin, > > you have anticipated my email :) We are tracking the issue in > https://phabricator.wikimedia.org/T219842, we had a

[Analytics] Easier mapping from Wikistats1 to Wikistats2 metrics

2019-03-28 Thread Nuria Ruiz
Hello! Analytics team would like to announce couple changes. We are working towards an easier way to navigate metrics that appear in both Wikistats1 and Wikistats2 and compare numbers, please take a look at changes deployed today for (for example) Italian Wikipedia:

Re: [Analytics] Availability of data on Wikipedia Zero rollout

2019-03-25 Thread Nuria Ruiz
Sneha, Some of the data that would be key to estimate the "increase of participation" you mention has either never been collected ("Whether those edits were being made using a device that accessed WP through WP Zero") or it was only retained short term, 90 days (" The kind of device being used

Re: [Analytics] R: Analytics Digest, Vol 85, Issue 3

2019-03-11 Thread Nuria Ruiz
ore specific > than "Re: Contents of Analytics digest..." > > > Today's Topics: > >1. R: Analytics Digest, Vol 85, Issue 2 (viviana paga) >2. Re: R: Analytics Digest, Vol 85, Issue 2 (Nuria Ruiz) > > > -

Re: [Analytics] R: Analytics Digest, Vol 85, Issue 2

2019-03-08 Thread Nuria Ruiz
>I thought having some stats by api-user-agent from backend could help me to understand these points and improve in the future my project in the best way. What do you >think ? Is there a procedure that can I follow to have these stats? The stats would be the same, viviana, raw counts of call from

Re: [Analytics] Further Development of Wikipedia statistics

2019-02-07 Thread Nuria Ruiz
Hello, Several things come to mind: Top views provides much of this info digested in a way that would not be hard to calculate what you want, gets data from pageviewAPI and does some useful filtering: https://tools.wmflabs.org/topviews/?project=de.wikipedia.org=all-access=last-month= You

Re: [Analytics] [Research-Internal] Article about ML in production woes

2019-02-07 Thread Nuria Ruiz
Team, Since everyone is here, we will be working on a machine learning infrastructure program this year. I will set up meetings with everyone on this thread and some others in SRE and Audiences to get a "bag of requests" of things that are missing, first round of talks that I hope to finish next

Re: [Analytics] Does prefetch count as a pageview?

2018-12-20 Thread Nuria Ruiz
here is a native browser feature that, when searching through the address >>>> bar (Google powered) by default silently starts loading the url of the top >>>> result shown below the address bar. Maybe there's a way we opted out, but I >>>> think it applies

Re: [Analytics] Does prefetch count as a pageview?

2018-12-19 Thread Nuria Ruiz
> I think that's for the Page Previews feature (i.e., when a user hovers over a link on desktop Wikipedia) or > its corresponding feature in the the Wikipedia for Android (triggered by default on link tap) The code that Fran pointed to only discounts "previews" by Android app as we stablished that

Re: [Analytics] Superset going down for a few hours

2018-12-13 Thread Nuria Ruiz
Superset is back up (should have said: "going down for a few minutes") , We have rolled back the upgrade in progress. Thanks, Nuria On Thu, Dec 13, 2018 at 1:00 PM Nuria Ruiz wrote: > Team: > > Superset will be going down for a few hours today as we rollback the > updat

[Analytics] Superset going down for a few hours

2018-12-13 Thread Nuria Ruiz
Team: Superset will be going down for a few hours today as we rollback the update we were trying to do. It turns out that the newest versions of superset are VERY non backwards compatible, they use python 3.6 which is not available on our debian distro and they introduce a bunch of other bugs.

[Analytics] Wikistats2 - Metrics available for project families

2018-12-12 Thread Nuria Ruiz
Hello! The Analytics team would like to announce that we have now in Wikistats2 metrics available for what we are calling (for the lack of a better name) "project families". That is, "all wikipedias", "all wikibooks"..etc See, for example, bytes added by users to all wikibooks in the last month:

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-19 Thread Nuria Ruiz
ice, once per hour and once daily looking 4 days back. Data >> should appear once daily job runs for the "holes" missing. > > +1 The EL2Druid daily loading job will cover up the holes for the 12th > and 13th in 1 or 2 days. > > On Thu, Nov 15, 2018 at 5:03 PM Nuria Rui

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-15 Thread Nuria Ruiz
Hello, Not all data sources are populated at the same time, the data on Druid is ingested twice, once per hour and once daily looking 4 days back. Data should appear once daily job runs for the "holes" missing. Thanks, Nuria On Thu, Nov 15, 2018 at 7:49 AM Andrew Otto wrote: > > Does "fixed"

Re: [Analytics] Pageviews by agent for May 18-21 2015

2018-11-13 Thread Nuria Ruiz
Hello, > One question we have is whether the pageviews we observe are driven by bots and spiders. We know that the > wikimedia rest api provides this information going back to July 1 2015. Please have in mind that these are only self-identified bots, there is probably about 1-5% of bot pageview

Re: [Analytics] Wiktionary word page views?

2018-10-23 Thread Nuria Ruiz
The pageview API has that data as long as "individual words" are considered "articles". See sample query: https://wikimedia.org/api/rest_v1/metrics/pageviews/per-article/en.wiktionary/all-access/all-agents/table/daily/2017100100/2017103100 Docs:

Re: [Analytics] Academic paper of Wikimedia' statistics v2?

2018-10-23 Thread Nuria Ruiz
Abel, If you are talking about http://stats.wikimedia.org/v2 the metric definition has not changed from the (now-called) "legacy wikistats 1" ( http://stats.wikimedia.org) . In the V2 system metrics are surfaced over a new UI and also new APIs so they are available programatically. Some docs:

Re: [Analytics] Community health metrics kit: Input needed!

2018-10-22 Thread Nuria Ruiz
This seems a start towards way to message "community health" that anyone can grasp: https://meta.m.wikimedia.org/wiki/Grants:IdeaLab/Health_rating_radio_button_template_on_talk_pages On Mon, Oct 22, 2018 at 4:10 AM ABEL SERRANO JUSTE wrote: > Thank you for opening the discussion. In our

[Analytics] New reports in wikistats2: "top editors" (a.k.a most prolific contributors) and "top edited articles"

2018-10-11 Thread Nuria Ruiz
Hello, The analytics team would like to announce two new metrics available in wikistats2: 1. Top editors (a.k.a most prolific contributors) See example for Italian wikipedia: https://stats.wikimedia.org/v2/#/it.wikipedia.org/contributing/top-editors/normal|table|1-Month|~total 2. Top edited

Re: [Analytics] When is the new pages API updated?

2018-10-10 Thread Nuria Ruiz
>Wikistats 1 generates data on content pages with a delay of 10-15 days after the end of the month This is true for full snapshots (for the reasons we have discussed before and that Dan has described on this thread). You can expect data to be available on the API soon after the 10th, but it is

[Analytics] Wikistats2 Better maps and new metric: Legacy Pageviews (a.k.a Pagecounts)

2018-07-11 Thread Nuria Ruiz
Hello! Just a brief note to announce that we have two new things in Wikistats2 this quarter. We have reviewed maps and we now report more precise pageviews per country. Check, for example, pageviews for Portuguese Wikipedia on the world for last month:

Re: [Analytics] most popular articles per country

2018-07-09 Thread Nuria Ruiz
Amir: FYI that this data has couple caveats: 1) the "-" is pageviews for a page for which we cannot extract a title. 2) data very much affected by bot spikes (you can mitigate that by filtering by agent_type="user" but still, a significant portion of bot traffic is not label as such).

[Analytics] Backfilling some eventlogging data on hadoop

2018-07-06 Thread Nuria Ruiz
Hello: An FYI that we are rerunning some of our jobs to backfill some eventlogging data on hadoop. Job should take a bout a day. Schemas affected are listed on ticket: https://phabricator.wikimedia.org/T198906 Thanks, Nuria ___ Analytics mailing list

Re: [Analytics] EventLogging MariaDB indexes

2018-05-27 Thread Nuria Ruiz
You can open a ticket and either our team or the dbas might be able to do it. Best might be looking at data in hadoop where you can query big amounts of it more easily. Evenloggibg data can be found on the “events” db on hive. Thanks, Nuria On Fri, May 25, 2018 at 11:22 AM Gilles Dubuc

Re: [Analytics] Content of wmf.wdqs_extract

2018-05-08 Thread Nuria Ruiz
Adrian: Please note that this table might disappear soon as the reserach it was created for has finished. Also, we will be rolling out (hopefully) next quarter similar tables that split our large dataset into smaller ones. That work is still WIP. Thanks, Nuria On Tue, May 8, 2018 at 12:22 AM,

[Analytics] Wikistats Data Outage issues

2018-04-23 Thread Nuria Ruiz
Hello! We are investigating a recent outage with data in wikistats. We shall report more as our understanding of issues progresses. Thanks, Nuria ___ Analytics mailing list Analytics@lists.wikimedia.org

Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-18 Thread Nuria Ruiz
> Is there any download link available for the *webrequest *datasets ? No, sorry, there is no download of webrequest data nor is it kept long term. As I mentioned before the best dataset that might fit your needs is this one: https://analytics.wikimedia.org/datasets/archive/public-

Re: [Analytics] Licensing for screenshots of pageviews data

2018-04-13 Thread Nuria Ruiz
My 2 cents: Data on pageviews endpoint is available under: https://creativecommons.org/publicdomain/zero/1.0/ (you need to expand each endpoint to see this, sorry, that UX could be better). You can add to pageview tool a note about licensing of the features it provides. For example: see the

Re: [Analytics] [Research-Internal] Spark2 upgraded to Spark 2.3.0, Spark 1 on the way out

2018-04-10 Thread Nuria Ruiz
FYI that this is happening today. Users may see slowness and paused jobs. We will send a note when upgrade is complete. Thanks, Nuria On Thu, Apr 5, 2018 at 1:22 PM, Andrew Otto wrote: > Hi all! > > I just upgraded spark2 across the cluster to Spark 2.3.0 >

Re: [Analytics] How to get the traces of requests to the Wikipedia site in each web server

2018-04-09 Thread Nuria Ruiz
Hello, I do not think our downloads or API provide a dataset like the one you are interested on. From your question I get the feeling that your assumptions on how our system works does not match reality, wikipedia might not be the best fit for your study. The closest data to what you are asking

Re: [Analytics] Monitor the number of Wikipedia sites and the number of articles in each site

2018-04-03 Thread Nuria Ruiz
Zainan: Labs is our cloud environment for volunteers, you can direct questions about that to cloud e-mail list. https://wikitech.wikimedia.org/wiki/Help:Cloud_Services_Introduction Thanks, Nuria On Mon, Apr 2, 2018 at 7:44 PM, Zainan Zhou (a.k.a Victor) wrote: > Thanks Dan,

Re: [Analytics] [Services] Getting more than just 1000 top articles from REST API

2018-04-02 Thread Nuria Ruiz
>are trying to rebuild our stale encyclopedia apps for offline usage but are space-limited and would only like to include the most likely pages that would be looked at that can fit within a size envelope >that varies with the device in question (up to 100k article limit probably) For this use case

Re: [Analytics] Migrated Reportcard with Updated Data

2018-03-11 Thread Nuria Ruiz
ighted e.g. in our monthly reports, > and IIRC that report card dashboard also included regional numbers). > > Have we preserved this data somewhere? > > On Fri, Apr 7, 2017 at 11:30 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: > >> Hello! >> >> The

Re: [Analytics] Wikipedia internal search clickstream

2018-03-05 Thread Nuria Ruiz
Short answer, no, this data is not available publicy such you can compute the dataset yourself as it is Private data. Thanks, Nuria On Mon, Mar 5, 2018 at 11:31 AM, Georg Sorst wrote: > Hi all, > > sorry for this messy post - I forgot to subscribe to the list so I

Re: [Analytics] PageView

2018-03-02 Thread Nuria Ruiz
>Or is there another method you also count that is gathered for other companies that collect views? Companies that do this such us comScore do it by getting their participants install (normally desktop software) in their machines and tracking page views that these participants do. It was the case

Re: [Analytics] Wikipedia internal search clickstream

2018-03-02 Thread Nuria Ruiz
>Did I miss something? Is this data available somewhere? You can find more information about click streams datasets here: https://blog.wikimedia.org/2018/01/16/wikipedia-rabbit-hole-clickstream/ Datasets do not include simple wiki, there are calculated for a few wikis some or which are not very

Re: [Analytics] How to get old page views data?

2018-02-22 Thread Nuria Ruiz
Peter: Do submit a phabricator tasks with your request, it'll be easier to follow on it than it is via e-mail. Our backlog: https://phabricator.wikimedia.org/tag/analytics/ I assume you know that per article views are available since 2015, a way to see those:

Re: [Analytics] Wikistats 2.0 - Now with Maps!

2018-02-22 Thread Nuria Ruiz
e that search bots and other obscure automated processes are distorting >> this data, and are there ways to filter that out in order to know where are >> the actual humans interested in a Wikimedia project? >> >> >> On Wed, Feb 14, 2018 at 11:15 PM, Nuria Ruiz <nu...@wik

Re: [Analytics] Page hourly views

2018-02-11 Thread Nuria Ruiz
Sorry, not sure we understand this question. Can you elaborate? On Sun, Feb 11, 2018 at 12:10 PM, Bo Han wrote: > Hello, > > Is the process for generating pageview hourly backed up? > > Thank you > > ___ > Analytics mailing list

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Nuria Ruiz
>Regarding the last few posts about the geolocation information, from the data analysis perspective, there is indeed another, more serious concern about using the GeoIP cookie: >It will create significant discrepancies with the existing geolocation data we record for pageviews, where we have

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Nuria Ruiz
>Wow Sam, yeah, if this cookie works for you, it will make many things much easier for us This is how it is done on performance schemas for Navigation timing data per country, so there is a precedence.

Re: [Analytics] [Product] Fwd: Session #6 and into all hands

2018-01-31 Thread Nuria Ruiz
Sorry, my last correspondence was for analytics-internal@ On Wed, Jan 31, 2018 at 8:29 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: > If you have time, do skim through these docs. I will do the same between > today and tomorrow, they are pretty informative as to how annual plan i

[Analytics] Fwd: [Product] Fwd: Session #6 and into all hands

2018-01-31 Thread Nuria Ruiz
If you have time, do skim through these docs. I will do the same between today and tomorrow, they are pretty informative as to how annual plan is and what audiences is doing. -- Forwarded message -- From: Jon Katz Date: Tue, Jan 30, 2018 at 8:16 PM

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-30 Thread Nuria Ruiz
>I’m not totally sure if this works for you all, but I had pictured generating aggregates from the page preview events, and then joining the page preview aggregates with the >pageview aggregates into a new table with an extra dimension specifying which type of content view was made. On my opinion

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, good to know - is there a report around that? I'm wondering how "missing requests" ought to be expressed with some margin of error. I think the ones that can quantify this best is your team. If anything from what I remember from pop ups experiments the inflow of events was higher than

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
en. On Wed, Jan 17, 2018 at 6:09 PM, Gergo Tisza <gti...@wikimedia.org> wrote: > On Wed, Jan 17, 2018 at 10:54 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: > >> Recording "preview_events" is really no different that recording any >> other kind of UI eve

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Nuria Ruiz
>Thanks, Sam. Nuria, that's what I was getting at - if using the EL JS library would some sort of new method be needed so that these impressions arena't undercounted? If we had a lot of users with DNT, maybe, from our tests when we enabled that on EL this is not the case. Your team has already run

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
> wrote: > > On Thu, Jan 18, 2018 at 8:16 AM, Nuria Ruiz <nu...@wikimedia.org> wrote: > >> Gergo, >> >> >while EventLogging data gets stored in a different, unrelated way >> Not really, This has changed quite a bit as of the last two quarters. >>

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
his to do whatever you like with it), and be refined into >>>> its own Hive table. >>>> >>>> > I don’t want to have to create that chart and export one dataset >>>> from pageviews and one dataset from eventlogging to do that. >>>> If you also d

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Nuria Ruiz
Gergo, >while EventLogging data gets stored in a different, unrelated way Not really, This has changed quite a bit as of the last two quarters. Eventlogging data as of recent gets preprocessed and refined similar to how webrequest data is preprocessed and refined. You can have a dashboard on top

Re: [Analytics] How best to accurately record page interactions in Page Previews

2018-01-17 Thread Nuria Ruiz
(Moving ops list to bcc) >Are there other ways of recording this information? We're fairly confident that #1 seems like the best choice here but it's referred to as the "virtual file view hack". Is this really the case? Yes, there are, please use eventlogging. Recording "preview_events" is

Re: [Analytics] Reboot of eventlog1001 for kernel upgrades

2018-01-15 Thread Nuria Ruiz
>If you see a dip in Eventlogging schema metrics (https://grafana.wikimedia. org/dashboard/db/eventlogging-schema?orgId=1) it will be my fault :) To be super clear: the host will stop consuming for as long as it is being rebooted, it iwll pick up past data once it comes back online. On Mon, Jan

Re: [Analytics] [Engineering] Important news about Analytics databases

2017-11-22 Thread Nuria Ruiz
>The log database is scheduled to be dropped from dbstore1002 on Tuesday 28th. After that, the log database will be available only on db1108 (analytics-slave.eqiad.wmnet). To make sure everyone is on the same page this means that you need to connect to analytics-slave.eqiad.wmnet if you wish to

Re: [Analytics] Undocumented project code in pagecounts-ez

2017-11-22 Thread Nuria Ruiz
Maybe this doc will help? https://wikitech.wikimedia.org/wiki/Analytics/Archive/Data/Pagecounts-all-sites#Disambiguating_abbreviations_ending_in_.E2.80.9C.m.E2.80.9D On Tue, Nov 14, 2017 at 1:29 PM, Michael Baldwin wrote: > Thanks, Federico. > > In the docs you

Re: [Analytics] research process (was Re: Google Code-in: Get your tasks for young contributors prepared!)

2017-11-17 Thread Nuria Ruiz
gt; Leila Zia > Senior Research Scientist > Wikimedia Foundation > > > On Tue, Nov 7, 2017 at 12:22 PM, Nuria Ruiz <nu...@wikimedia.org> wrote: > > I would say that referrer "origin-when-cross-origin" (Send a full URL > when > > performing a same-origin r

Re: [Analytics] research process (was Re: Google Code-in: Get your tasks for young contributors prepared!)

2017-11-07 Thread Nuria Ruiz
I would say that referrer "origin-when-cross-origin" (Send a full URL when performing a same-origin request, but only send the origin of the document for other cases) is probably the most widely deployed default on the internets, we use it as well as google, facebook... For wikipedia, see:

Re: [Analytics] Heads up: mw.track client-side EventLogging mechanism "ignored" certain events

2017-10-13 Thread Nuria Ruiz
pened e.g. when I reproduced the bug > here: https://phabricator.wikimedia.org/T175918#3612580 , and is also > evident in data e.g. from the previous Popups experiments). > > On Thu, Oct 12, 2017 at 2:14 PM, Nuria Ruiz <nu...@wikimedia.org> wrote: > > Please have in mind that hiting this b

Re: [Analytics] Heads up: mw.track client-side EventLogging mechanism "ignored" certain events

2017-10-12 Thread Nuria Ruiz
Please have in mind that hiting this bug is a race condition and it is hit in a minority of cases, not all times. The essence of the bug has to do with the subscription to the "load" event. In some instances the event had already happened by the time the EL code was loaded. Thanks, Nuria On

[Analytics] Archiving some eventlogging tables to hadoop

2017-10-03 Thread Nuria Ruiz
Team: Our mysql backend for eventlogging is having issues due to disk space. We need to free space on near term so we will be archiving some tables to hadoop. Please see: https://phabricator.wikimedia.org/T168303 and:

Re: [Analytics] Resources stat1005

2017-08-14 Thread Nuria Ruiz
Adrian, You already have access to use the cluster, which is where you should move your processing, the link to yarn was just to show resource consumption. Thanks, Nuria On Sat, Aug 12, 2017 at 3:52 PM, Adrian Bielefeldt < adrian.bielefe...@mailbox.tu-dresden.de> wrote: > Hi Andrew, > >

Re: [Analytics] Article creation stats

2017-08-14 Thread Nuria Ruiz
>Would there happen to be a dataset of that available somewhere? Data is available on public labs replicas but sql is complicated to write and likely to time out due the volume of data that is combing. Data is also available on Hadoop Data Lake which is not public yet (it is our plan to make it

Re: [Analytics] fishy browser stats

2017-08-03 Thread Nuria Ruiz
nd > analysis etc." They have since increased and, as can be gleaned from > Kaldari's remarks, do indeed affect our global stats markedly now. I have > started to remove them in the pageviews stats and trends I'm preparing, > will follow up with more detail on Phabricator. > > >

Re: [Analytics] Daily merged pageviews stopped ?

2017-08-01 Thread Nuria Ruiz
Ticket here: https://phabricator.wikimedia.org/T172032 On Tue, Aug 1, 2017 at 1:22 PM, Akeron wrote: > Hello, > > Last file is one week ago : pagecounts-2017-07-23.bz2 > https://dumps.wikimedia.org/other/pagecounts-ez/merged/2017/2017-07/ > > Thanks, > > Akeron. > >

Re: [Analytics] Analytics project request

2017-07-24 Thread Nuria Ruiz
Daniel, Singining an NDA is not enough to get access to the data, you also need to be part of a formal research collaboration with our research team, they have a number of those and they are not likely to accept any more soon but you can contact them on that regard:

Re: [Analytics] fishy browser stats

2017-07-21 Thread Nuria Ruiz
>Surely this can't be accurate though as most other sites on the internet report virtually non-existent usage of IE7 (less than 1% everywhere I've checked). Can someone >double-check this? This is likely bot traffic with IE7 user-agent. See: https://phabricator.wikimedia.org/T148461 We will

[Analytics] Eventlogging incident report

2017-07-18 Thread Nuria Ruiz
Team: Please see the recent incident report for eventlogging [1], [2] TL;DR After addition of some EventBus events to MySQL we had an issue with insertion of events in which some events were dropped. This affected all schemas. Events for al schemas have been backfilled as of now. [1]

Re: [Analytics] [Wikitech-l] Drop in mainpage pageviews?

2017-07-17 Thread Nuria Ruiz
> do not remember the exact date, but a couple of months ago the way for > pageviews counting was changed, using cookies, affecting the mobile web > views. This can be the cause. > Igal (User IKhitron) ... not sure what you are referring to, can you be more specific? We have not changed the

Re: [Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Nuria Ruiz
> On Wed, Jul 12, 2017 at 12:16 PM, Nuria Ruiz <nu...@wikimedia.org> wrote: > > Further clarification that this snapshot of data is not yet public > (meaning > > available to the outside world, not just WMF/NAD holders) . > > Thanks for clarifying this and the work y

Re: [Analytics] new mediawiki_history snapshot available

2017-07-12 Thread Nuria Ruiz
Further clarification that this snapshot of data is not yet public (meaning available to the outside world, not just WMF/NAD holders) . Our team is working towards making this data available next year in labs in the same fashion that data is now available on the labs replicas. Thanks, Nuria On

[Analytics] Dropping MoodBar extension tables from all wikis

2017-07-07 Thread Nuria Ruiz
Hello! This is an FYI that ModBar extension has been undeployed and, as such, its tables will be removed from all wikis. See https://phabricator.wikimedia.org/T153033 It looks like this extension sprang some interest in the past [1] and there were some research projects about it. Please let us

Re: [Analytics] [Research-Internal] [Ops] EventStreams launch and RCStream deprecation

2017-06-27 Thread Nuria Ruiz
I think Jon got his question answered but to keep archives happy. Here is length for the new event: https://github.com/wikimedia/mediawiki-event-schemas/blob/master/jsonschema/mediawiki/recentchange/1.yaml#L113 On Wed, Mar 8, 2017 at 5:02 PM, Jon Robson wrote: > Hey

Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-14 Thread Nuria Ruiz
>(i.e. implying that we need to collect the data somewhere else, and move to production for number crunching only)? I think we should probably set up a sync up so you get an overview of how this works cause this is a brief response. Data is harvested in some production machines, it is processed

Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-13 Thread Nuria Ruiz
Adrian, >At the moment I'm working on checking which entries equal one of the example queries at https://www.wikidata.org/>wiki/Wikidata:SPARQL_query_serv ice/queries/examples using this

Re: [Analytics] Connect to wikidata.org from stat1002.eqiad.wmnet

2017-05-12 Thread Nuria Ruiz
Adrian, Can you give us some context as to what is the project you are working on/what are you trying to do? Thanks, Nuria On Sat, May 13, 2017 at 1:12 AM, Adrian Bielefeldt < adrian.bielefe...@mailbox.tu-dresden.de> wrote: > Hello everyone, > > I wanted to ask how I have to proceed to be

  1   2   3   4   >