[Analytics] Re: API Outages

2023-02-24 Thread Andrew Otto
est as input (including pageviews computation) began to timeout while waiting for input. - We have been slowly restarting and recovering jobs now that webrequest ingestion has caught up again. I don't know exactly how long data is delayed or when it will be fully available, but I'd guess: soon / today?

[Analytics] Re: energy used to store

2023-02-02 Thread Andrew Otto
Hi Willy, (Forwarding your question to the public analytics list for others who might know more.) > Do you have any data that shows how many times audio files were downloaded in 2022? I think your best bet is the Mediacounts dataset

[Analytics] Re: Data engineering risks when migrating to Kubernetes

2022-07-05 Thread Andrew Otto
Can those be run > against test2wiki? > > Thanks, > > Dom > > Andrew Otto writes: > > Hm, I don't think the analytics systems interact too directly with > MediaWiki. They do use the EventStreamConfig extension's API from > meta.wikimedia.org. The EventLogging ex

[Analytics] Re: Data engineering risks when migrating to Kubernetes

2022-06-21 Thread Andrew Otto
Hm, I don't think the analytics systems interact too directly with MediaWiki. They do use the EventStreamConfig extension's API from meta.wikimedia.org. The EventLogging extension

[Analytics] stream.wikimedia.org - stream retention change

2022-03-21 Thread Andrew Otto
*. In the future, we would like to intentionally remove this data from streams. Doing so requires us to maintain new services that produce new streams with PII information redacted. Doing this is not a trivial thing to stand up, hence this mitigation effort for now. -Andrew Otto Wikimedia Foundation

Re: [Analytics] Fixing missing revision-create events & removing the rev_is_revert field

2021-05-05 Thread Andrew Otto
if there are objections. Thank you! -Andrew Otto SRE, Data Engineering, WMF On Mon, Apr 19, 2021 at 9:37 AM Andrew Otto wrote: > Hi all, > > tl;dr: we'd like to remove the rev_is_revert field from the > mediawiki.revision-create stream to solve a missing event problem. > > For year

[Analytics] Fixing missing revision-create events & removing the rev_is_revert field

2021-04-19 Thread Andrew Otto
bricator.wikimedia.org/T215001 Thanks! -Andrew Otto SRE, Data Engineering, WMF ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] About Varnish NTP server time accuracy

2021-03-16 Thread Andrew Otto
Am not totally sure how our NTP setup works, but it is all in the WMF's puppet repository . https://github.com/wikimedia/puppet/blob/production/modules/profile/manifests/ntp.pp looks like a good place to start. On Tue, Mar 16, 2021 at 6:04 AM Ho Chung wrote:

Re: [Analytics] About readership timestamp

2021-03-15 Thread Andrew Otto
Hi, Yes we prefer to always use UTC for timestamps. On Fri, Mar 12, 2021 at 2:40 PM Ho Chung wrote: > Hello > > In this page did you know when any readership visit any Chinese web page > , > > Eg. https://zh.wikipedia.org/wiki/MP3 > > > the timestamp is use UTC ? > > >

Re: [Analytics] Pageviews data for February 9th

2021-02-10 Thread Andrew Otto
See also: https://lists.wikimedia.org/pipermail/analytics-announce/2021-February/59.html https://phabricator.wikimedia.org/T273711 https://phabricator.wikimedia.org/T274322 On Wed, Feb 10, 2021 at 2:21 PM Dan Andreescu wrote: > Thanks for the message. We had a major cluster upgrade

Re: [Analytics] EventLogging blocked by ad blockers

2020-09-22 Thread Andrew Otto
Event Platform uses a new url: https://intake-analytics.wikimedia.org/v1/events. We're working on migrating all legacy EventLogging over to Event Platform EventGate. When that happens the legacy data will be POSTed to the new URL. You could migrate over early if you want by creating a new

Re: [Analytics] [Research-Internal] Tutorials on disk space usage for notebook/stat boxes

2020-02-18 Thread Andrew Otto
I added a 'GPU?' column too. :) THANKS LUCA! On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano wrote: > Hey Diego, > > added a section at the end of the page with the info requested, let me > know if anything is missing :) > > Luca > > Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper <

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Andrew Otto
Hm, interesting! I don't think many of us have used SparkSession.builder.getOrCreate repeatedly in the same process. What happens if you manually stop the spark session first, (session.stop()

Re: [Analytics] [Wiki-research-l] Enable Kerberos authentication for Hadoop (please read if you use Hadoop for your daily work)

2019-12-17 Thread Andrew Otto
Note! For Hive Jupyter users, there is an issue with PyHive and Kerberos, so you'll want to switch to Impyla. I've updated docs at https://wikitech.wikimedia.org/wiki/SWAP#with_Hive_(MapReduce) with instructions. ___ Analytics mailing list

[Analytics] Spark 2.4 upgrade for Analytics Cluster - Tuesday, November 5th

2019-10-28 Thread Andrew Otto
here: https://phabricator.wikimedia.org/T53 - Andrew Otto (Systems Engineer) & Analytics Engineering Team ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] EventLogging MySQL Deprecation

2019-10-28 Thread Andrew Otto
instance and eventually repurposing the hardware. More info here and in sub tickets: https://phabricator.wikimedia.org/T159170 Thanks! - Andrew Otto (Systems Engineer) & Analytics Engineering Team ___ Analytics mailing list Analytics@lists.wikimedia

Re: [Analytics] Spark 1.x to be removed from analytics cluster; use spark2-* (and pyspark2) only

2019-02-21 Thread Andrew Otto
Hi again! This has been done. Documentation has been updated at https://wikitech.wikimedia.org/wiki/Analytics/Systems/Cluster/Spark. On Tue, Jan 22, 2019 at 10:17 AM Andrew Otto wrote: > Hi friends! > > Spark 1.x is pretty old. We only keep it around because it is a standa

[Analytics] Article about ML in production woes

2019-02-07 Thread Andrew Otto
Just came across https://www.confluent.io/blog/machine-learning-with-python-jupyter-ksql-tensorflow In it, the author discusses some of what he calls the 'impedance mismatch' between data engineers and production engineers. The links to Ubers Michelangelo

[Analytics] Spark 1.x to be removed from analytics cluster; use spark2-* (and pyspark2) only

2019-01-22 Thread Andrew Otto
!). (If this timeline doesn't work for you just let us know and we'll adjust.) Thanks! - Andrew Otto & Analytics Engineering https://phabricator.wikimedia.org/T212134 ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/list

Re: [Analytics] Hive EventLogging bug caused NULL fields since 2018-11-29

2019-01-14 Thread Andrew Otto
. More info here: https://phabricator.wikimedia.org/T213602. Sorry for the inconvenience. -Andrew Otto Systems Engineer, WMF On Fri, Dec 14, 2018 at 3:48 PM Andrew Otto wrote: > Hi all, > > A bug in the code that imports EventLogging data into Hive caused top 3 > level EventCa

[Analytics] Rsync between stat* and notebook* to allow only pull

2019-01-02 Thread Andrew Otto
, please update them accordingly. Thanks! -Andrew Otto, Systems Engineer ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] Hive EventLogging bug caused NULL fields since 2018-11-29

2018-12-14 Thread Andrew Otto
(more importantly) userAgent. We've fixed the bug, and are backfilling the data now. https://phabricator.wikimedia.org/T211833 has more info. Sorry for the inconvenience! Follow the phabricator ticket to get updates on when backfilling has completed. -Andrew Otto Systems Enginee

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-15 Thread Andrew Otto
gt; seeing gaps (zero events) in Turnilo for Druid-ingested EL data, for the >> timespans between around 6am-16pm on November 13, and 7am-10am on November >> 12. >> >> On Thu, Nov 15, 2018 at 6:51 AM Andrew Otto wrote: >> >>> OH I’m sorry! There is a Phab t

Re: [Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-15 Thread Andrew Otto
follow? > > On Tue, Nov 13, 2018 at 6:27 PM Andrew Otto wrote: > >> Hi all, >> >> Yesterday we upgraded the Hadoop cluster to a newer version. It seems >> that along the way the job that imports EventLogging data into Hive has >> started failing for some Ev

[Analytics] EventLogging Hive Refine currently stalled for some Schemas

2018-11-13 Thread Andrew Otto
, and we’ll keep you updated! -Andrew Otto Systems Engineer, WMF The list of currently affected tables is: AdvancedSearchRequest CentralAuth CentralNoticeBannerHistory CentralNoticeImpression CentralNoticeTiming ChangesListFilterGrouping ChangesListFilters CitationUsage CitationUsagePageLoad

Re: [Analytics] Request for pageview statistics pre 2015

2018-11-13 Thread Andrew Otto
https://dumps.wikimedia.org/other/analytics/ :) On Tue, Nov 13, 2018 at 9:41 AM William Corin David East, Mr < william.e...@mcgill.ca> wrote: > Hi, > > > > As part of my research on political behavior, I’m working on a replication > project of using pageview statistics to predict election

Re: [Analytics] Persisting some temp data in hive, so that others can access it

2018-09-24 Thread Andrew Otto
Heya! We totally support users creating their own Hive tables. You should do so in your own Hive database. If you don’t yet have one, you should be able to create one “CREATE DATABASE ;” On Mon, Sep 24, 2018 at 10:43 AM Ian Marlier wrote: > Hi there -- > > I've been doing some analysis using

[Analytics] EventLogging MySQL Schema Whitelist

2018-09-11 Thread Andrew Otto
Hi all you EventLogging users out there! tl;dr We will switch EventLogging MySQL ingestion to be based on a schema whitelist rather than blacklist. As you know, we currently import EventLogging events into two locations for analysis: The MySQL ‘log’ database, and the Hive ‘event’ database.

Re: [Analytics] stats.wikimedia.org maintenance downtime

2018-09-05 Thread Andrew Otto
This has been done, thanks all! On Tue, Aug 28, 2018 at 12:53 PM Andrew Otto wrote: > Hi all, > > On Wednesday September 5th at around 13:30 UTC we will be taking > stats.wikimedia.org and analytics.wikimedia.org offline for a server > upgrade. We expect this downtime to tak

[Analytics] stats.wikimedia.org maintenance downtime

2018-08-28 Thread Andrew Otto
. Thanks! -Andrew Otto Systems Engineer Wikimedia Foundation ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Internal analytics tools upgrades (Superset, Turnilo, Hue)

2018-08-28 Thread Andrew Otto
This is done, please let us know if you have any issues. Along the way, we upgraded Superset to 0.26.3 and Turnilo to 1.7.2. Stay tuned for the following announcement about downtime for stats.wikimedia.org and analytics.wikimedia.org. On Mon, Aug 27, 2018 at 3:07 PM Andrew Otto wrote

[Analytics] Internal analytics tools upgrades (Superset, Turnilo, Hue)

2018-08-27 Thread Andrew Otto
and turnilo either this week or next. The move should be transparent to you all (you might have to re-log in). Just in case, if you encounter any issues please report them here: https://phabricator.wikimedia.org/T202011 Thanks! - Andrew Otto Systems Engineer, WMF

[Analytics] SWAP (Jupyter Notebooks) now supports Spark

2018-08-13 Thread Andrew Otto
ounter issues of have questions, please respond on this phabricator ticket <https://phabricator.wikimedia.org/T190443>, or create a new one and add the Analytics tag. Enjoy! -Andrew Otto & Analytics Engineering ___ Analytics mailin

[Analytics] EventStreams goes multi-datacenter on Monday August 6

2018-08-01 Thread Andrew Otto
Source clients will reconnect automatically and begin to use timestamps instead of offsets in the Last-Event-ID. You can read more about this work here: https://phabricator.wikimedia.org/T199433 - Andrew Otto, Systems Engineer, WMF ___ Analytics mailin

[Analytics] EventStreams offset reset - June 14 2018

2018-06-12 Thread Andrew Otto
Alright, we are now ready to do this. New date: June 14 2018, around 14:00 UTC. On Tue, Jun 5, 2018 at 9:31 AM, Andrew Otto wrote: > Hi, > > We need to delay this switchover. We’ve made some changes to the plan > that require a bit more prep work. Follow https:/

Re: [Analytics] EventStreams offset reset - June 5 2018

2018-06-05 Thread Andrew Otto
Hi, We need to delay this switchover. We’ve made some changes to the plan that require a bit more prep work. Follow https://phabricator.wikimedia.org/T185225 for more details. I’ll reply with another announcement email when we set a new date. - Andrew Otto Senior Systems Engineer, WMF

Re: [Analytics] Pivot is now Turnilo!

2018-05-23 Thread Andrew Otto
. > > For this reason, I am thrilled that you found and implemented a > replacement that we intend to support. > > -J > > On Mon, May 21, 2018 at 11:25 AM Andrew Otto <o...@wikimedia.org> wrote: > >> Hi all! >> >> Your beloved Pivot may not be dying af

Re: [Analytics] Reinstall of stat1004 scheduled for Tuesday May 22

2018-05-22 Thread Andrew Otto
Annnd we’re done! Your home directories should be back in place. Just like stat1005, /home is actually on /srv, so there should be more space to work with in your home directories now. Thanks! On Mon, May 21, 2018 at 1:07 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all, > >

Re: [Analytics] Reinstall of stat1004 scheduled for Tuesday May 22

2018-05-22 Thread Andrew Otto
FYI, I am beginning this now. On Mon, May 21, 2018 at 1:07 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all, > > We are slowly reinstalling the operating systems on analytics servers to > upgrade to Debian Stretch. We’d like to do stat1004 tomorrow, Tuesday May > 22.

[Analytics] Pivot is now Turnilo!

2018-05-21 Thread Andrew Otto
be configuring a redirect from pivot.wikimedia.org to turnilo.wikimedia.org. Any bookmarked links you have should transparently redirect and work i Turnilo. If not, let us know! We will be configuring the redirect this week on Wednesday May 23. - Andrew Otto

[Analytics] Reinstall of stat1004 scheduled for Tuesday May 22

2018-05-21 Thread Andrew Otto
on May 22. Thanks! - Andrew Otto https://phabricator.wikimedia.org/T192640 ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] EventStreams offset reset - June 5 2018

2018-05-15 Thread Andrew Otto
June 5 2018, at around 17:30 UTC. Let us know if you have any questions. Thanks! - Andrew Otto Senior Systems Engineer, WMF ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Content of wmf.wdqs_extract

2018-05-07 Thread Andrew Otto
CCing Stas, he might know more. On Sun, May 6, 2018 at 9:58 AM, Adrian Bielefeldt < adrian.bielefe...@mailbox.tu-dresden.de> wrote: > Hello everyone, > > I wanted to ask if anyone can tell me what wmf.wdqs_extract contains. I > know generally that it is the query log of the SPARQL endpoint.

Re: [Analytics] New SWAP (Jupyter Notebook) servers and updates!

2018-04-12 Thread Andrew Otto
FYI, I have started the decomission of notebook1001, it will no longer be accessible. On Thu, Apr 5, 2018 at 2:24 PM, Chelsy Xie <c...@wikimedia.org> wrote: > Thank you Andrew! > > On Thu, Apr 5, 2018 at 7:17 AM, Andrew Otto <o...@wikimedia.org> wrote: > >> Tilman

[Analytics] Spark2 upgraded to Spark 2.3.0, Spark 1 on the way out

2018-04-05 Thread Andrew Otto
Hi all! I just upgraded spark2 across the cluster to Spark 2.3.0 . If you are using the pyspark2*, spark2-*, etc. executables, you will now be using Spark 2.3.0. We are moving towards making Spark 2 the default Spark for all Analytics

Re: [Analytics] New SWAP (Jupyter Notebook) servers and updates!

2018-04-05 Thread Andrew Otto
; say early next week, if that doesn't disrupt other things. >> >> In any case, +1 to what Leila said - I really appreciate the technical >> support for SWAP and am excited about the additional possibilities that >> this upgrade is bringing. >> >> On Mon, Apr 2

Re: [Analytics] New SWAP (Jupyter Notebook) servers and updates!

2018-04-02 Thread Andrew Otto
, Andrew Otto <o...@wikimedia.org> wrote: > Hi everyone! > > *tl;dr stop using notebook1001 by Monday April 2nd, use notebook1003 > instead.* > > *(If you don’t have production access, you can ignore this email.)* > > As part of https://phabricator.wikimedia.org/T183145

Re: [Analytics] New SWAP (Jupyter Notebook) servers and updates!

2018-03-22 Thread Andrew Otto
Oh, I forgot one thing! JupyterLab is now available too! It isn’t (yet) the default, but if you are able, try it out instead of regular old Jupyter. To do so, navigate to http://localhost:8000/user//lab On Thu, Mar 22, 2018 at 3:34 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi

[Analytics] New SWAP (Jupyter Notebook) servers and updates!

2018-03-22 Thread Andrew Otto
python notebook. I’ve updated docs at https://wikitech.wikimedia.org/wiki/SWAP#Usage, please take a look. If you have any questions, please don’t hesitate to ask, either here on or phabricator: https://phabricator.wikimedia.org/T183145. - Andrew Otto & Analytics E

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Can we keep further discussion on the phablet thread? ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
Gonna paste your reply on the ticket <https://phabricator.wikimedia.org/T184793> and respond there. On Wed, Feb 7, 2018 at 1:29 PM, Tilman Bayer <tba...@wikimedia.org> wrote: > On Wed, Feb 7, 2018 at 9:19 AM, Andrew Otto <o...@wikimedia.org> wrote: > >> It will cr

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-07 Thread Andrew Otto
t is recorded for both should have identical values for both > the preview and the pageview.) Therefore, we should go with the kind of > solution Andrew outlined above (adapting/reusing GetGeoDataUDF or such). > > On Thu, Feb 1, 2018 at 7:36 AM, Andrew Otto <o...@wikimedia.org> wrote

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-02-01 Thread Andrew Otto
e, Jan 30, 2018 at 8:02 AM, Andrew Otto <o...@wikimedia.org> wrote: > >> > Using the GeoIP cookie will require reconfiguring the EventLogging >> varnishkafka instance [0] >> >> I’m not familiar with this cookie, but, if we used it, I thought it would >&

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-30 Thread Andrew Otto
CoOOOl :) > Using the GeoIP cookie will require reconfiguring the EventLogging varnishkafka instance [0] I’m not familiar with this cookie, but, if we used it, I thought it would be sent back to by the client in the event. E.g. event.country = response.headers.country; EventLogging.emit(event);

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
> You could join these together in a broader ‘content consumption’ dataset somehow, either in Hadoop with batch jobs, or more realtime with streaming jobs. Hm, idea…which I think has been mentioned before: Could we leave pageviews as is, but make a new dataset that counts both pageviews and page

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-19 Thread Andrew Otto
> For virtual pageviews, people will probably be more interested in reports that belong to the first group (summing them up with normal pageviews, breaking them down along the dimensions that are relevant for web traffic, counting them for a given URL etc). Ah! Ok I get this use case now. I

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
media.org> wrote: > On Thu, Jan 18, 2018 at 10:45 AM, Andrew Otto <o...@wikimedia.org> wrote: > >> > the beacon puts the record into the webrequest table and from there it >> would only take some trivial preprocessing >> ‘Trivial’ preprocessing that has to look throu

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
> For example, UI instrumentations on the web are almost always sampled, because that yields enough data to answer UI questions - but on the other hand tend to record much more detail about the individual interaction. In contrast, we register all pageviews unsampled, but don't keep a permanent

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
th the same depth as pageviews. > > Thanks! > > - Olga > > On Thu, Jan 18, 2018 at 12:46 PM Andrew Otto <o...@wikimedia.org> wrote: > >> > the beacon puts the record into the webrequest table and from there it >> would only take some trivial preprocessi

Re: [Analytics] [Ops] How best to accurately record page interactions in Page Previews

2018-01-18 Thread Andrew Otto
> the beacon puts the record into the webrequest table and from there it would only take some trivial preprocessing ‘Trivial’ preprocessing that has to look through 150K requests per second! This is a lot of work! > tracking of events is better done on an event based system and EL is such a

[Analytics] EventLogging data now available in Hive

2017-12-04 Thread Andrew Otto
Hi everyone, We are now supporting Hive tables with EventLogging data! This has been a long project . We finally feel comfortable enough to announce support for this method of querying EventLogging data. You can read documentation on how to access this

[Analytics] Spark 2 now available in Hadoop

2017-11-13 Thread Andrew Otto
<https://github.com/wikimedia/analytics-refinery-source> still uses Spark 1, but we’d also like to update jobs and dependencies there to use Spark 2 soon. Anyway, let me know if there are any questions. Enjoy! - Andrew Otto Systems Engineer, WMF _

Re: [Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-09-05 Thread Andrew Otto
Alright, I just ran a final rsync of /home on stat1003 -> stat1006. Only files that did not yet exist on stat1006, or have a newer modification time on stat1003 were copied over. We will begin the decommission of stat1003 tomorrow. On Mon, Aug 28, 2017 at 12:02 PM, Andrew Otto

[Analytics] EventStreams Outage

2017-08-29 Thread Andrew Otto
for us (me) to notice, so I’ve created https://phabricator.wikimedia.org/T174493 to help us catch something like this in the future. Apologies if this caused any inconvenience. -Andrew Otto Systems Engineer, Wikimedia Foundation ___ Analytics mailing list

Re: [Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-08-28 Thread Andrew Otto
Hi all! Just an update: We plan to decommission stat1003 next week. I’ll be sure to run a final home directory rsync from stat1003 -> stat1006 before we do. On Tue, Jul 18, 2017 at 1:31 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all! > > tl;dr: Stop using stat100[23

Re: [Analytics] Resources stat1005

2017-08-12 Thread Andrew Otto
Hadoop anyway, so you might as well continue to do processing there, no? If you do, you shouldn’t have to worry (too much) about resource contention: https://yarn.wikimedia.org/cluster/scheduler :) - Andrew Otto Systems Engineer, WMF On Sat, Aug 12, 2017 at 2:20 PM, Erik Zachte

Re: [Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-07-27 Thread Andrew Otto
This is done! stat1002 is offline. Thanks all! On Thu, Jul 27, 2017 at 5:27 PM, Andrew Otto <o...@wikimedia.org> wrote: > > /a and /home on stat1002 (as they are now) will no longer be accessible > early next week. > > Welp, it turns out we need to accelerate this t

Re: [Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-07-27 Thread Andrew Otto
Thu, Jul 27, 2017 at 1:57 PM, Andrew Otto <o...@wikimedia.org> wrote: > > Please be fully migrated to the new nodes by September 1st. > > Some other issues are accelerating this timeline for stat1002. > > All /home directories from stat1002 have been synced to stat1005. In >

Re: [Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-07-27 Thread Andrew Otto
ces anyone. - Andrew Otto Systems Engineer Wikimedia Foundation On Tue, Jul 18, 2017 at 1:31 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all! > > tl;dr: Stop using stat100[23] by September 1st. > > We’re finally replacing stat1002 and stat1003. These box

[Analytics] stat1002 and stat1003 deprecated. Please use new stat boxes

2017-07-18 Thread Andrew Otto
you of the impending deadline of Sept 1. -Andrew Otto ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] EventStreams launch and RCStream deprecation

2017-07-10 Thread Andrew Otto
Alright, we’ve done it! RCStream is disabled, so any remaining socket.io service connecting to stream.wikimedia.org/rc will fail. Thanks all! On Thu, Jun 22, 2017 at 1:00 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all, > > This is just a friendly reminder that we p

Re: [Analytics] EventStreams launch and RCStream deprecation

2017-06-22 Thread Andrew Otto
running on RCStream that hasn’t yet ported, let us know, and/or switch soon! Thanks! -Andrew Otto On Wed, Feb 8, 2017 at 9:28 AM, Andrew Otto <o...@wikimedia.org> wrote: > Hi everyone! > > Wikimedia is releasing a new service today: EventStreams > <https://wikite

Re: [Analytics] [Research-Internal] Fwd: db1047 (s1, s2 analytlics-slave) downtime Wednesday May 3

2017-05-03 Thread Andrew Otto
- Forwarded message -- > From: Andrew Otto <o...@wikimedia.org> > Date: Fri, Apr 28, 2017 at 12:12 PM > Subject: [Analytics] db1047 (s1, s2 analytlics-slave) downtime Wednesday > May 3 > To: "A mailing list for the Analytics Team at WMF and everybody who has

[Analytics] db1047 (s1, s2 analytlics-slave) downtime Wednesday May 3

2017-04-28 Thread Andrew Otto
Hi all! The server that hosts s1, s2 analytics-slaves (db1047) needs a little love. It’ll be offline for a little while on Wed May 3, starting at 10am EST time. https://phabricator.wikimedia.org/T159266 -Ao ___ Analytics mailing list

Re: [Analytics] Short Hive, Oozie, Druid & Pivot downtime Tuesday April 25th

2017-04-24 Thread Andrew Otto
ow if this causes any trouble to you. >> >> Thanks and sorry for the late notification! >> >> Luca >> >> 2017-04-13 18:36 GMT+02:00 Andrew Otto <o...@wikimedia.org>: >> >>> Update: >>> >>> Due to the big Datacenter Switchover happeni

[Analytics] Short Hive, Oozie, Druid & Pivot downtime Tuesday April 25th

2017-04-13 Thread Andrew Otto
Update: Due to the big Datacenter Switchover happening next week, we’ve decided to postpone this a bit. We won’t be doing this downtime on Monday April 17th. Instead, we will do this at 13:30 UTC on Tuesday April 25th. Thanks all! On Thu, Apr 6, 2017 at 1:04 PM, Andrew Otto &l

[Analytics] Short Hive, Oozie, Druid & Pivot downtime Monday April 17th

2017-04-06 Thread Andrew Otto
Hi all! As part of our Hadoop Cluster Debian Jessie upgrade, we need to reinstall the server that acts as a metadata state store for Hive, Oozie and Druid. To be safe, we plan to take those services plus Pivot (which runs on Druid) offline during while we reinstall this server. We plan to do

Re: [Analytics] Data Lake documentation on Wikitech

2017-03-25 Thread Andrew Otto
Ja, +1 to what dan said. Data Lake is our term for the ability to serve queries on large refined datasets to users. Analytics Cluster refers to (almost) all of our infrastructure. Hadoop will likely always power part of Data Lake, but in an ideal world, we’d have some other lower latency query

Re: [Analytics] Analytics Cluster Hadoop Upgrade and Downtime: February 28 2017

2017-02-28 Thread Andrew Otto
And, we’re back up! Thanks all! Everything went smoothly. (Big thanks to Luca for driving.) -Andrew & Luca On Tue, Feb 28, 2017 at 9:07 AM, Andrew Otto <o...@wikimedia.org> wrote: > Allright! This is starting now. > > On Mon, Feb 27, 2017 at 3:29 PM, Andrew Otto

Re: [Analytics] Analytics Cluster Hadoop Upgrade and Downtime: February 28 2017

2017-02-28 Thread Andrew Otto
Allright! This is starting now. On Mon, Feb 27, 2017 at 3:29 PM, Andrew Otto <o...@wikimedia.org> wrote: > Just a reminder, we will be taking the Analytics Hadoop Cluster offline > tomorrow morning EST. I’ll email again tomorrow right before we do, and > also onc

Re: [Analytics] Missing mediacounts for 2016-12-01

2017-02-16 Thread Andrew Otto
Ah ha! Yeah, a certain hour didn’t run in time, so the archive/copy job timed out. Sorry about that. I’ve rerun and the files exist now. Thanks for letting us know! On Thu, Feb 16, 2017 at 11:27 AM, Luca Toscano wrote: > > > 2017-02-16 15:18 GMT+01:00 Federico Leva

[Analytics] Analytics Cluster Hadoop Upgrade and Downtime: February 28 2017

2017-02-14 Thread Andrew Otto
Hi everyone, We are planning an upgrade of the Hadoop cluster on February 28th. We need to take the cluster down for this upgrade. The actual upgrade shouldn’t take more than 2 hours, but we’re going to reserve the whole work day of February 28th to do this, just in case something goes wrong.

[Analytics] EventStreams launch and RCStream deprecation

2017-02-08 Thread Andrew Otto
ntly available events are described here <https://github.com/wikimedia/mediawiki-event-schemas/tree/master/jsonschema/mediawiki> .) Thanks! - Andrew Otto ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

Re: [Analytics] Fwd: Removing wfIncrStats from MobileFrontend's Special:MobileOptions page

2017-01-25 Thread Andrew Otto
I’m not totally sure. statsd doesn’t actually store anything; these metrics are saved by graphite. We may have to manually purge them. I’d open a Phab ticket, CC me, and tag Operations. On Wed, Jan 25, 2017 at 3:34 PM, Sam Smith wrote: > Hullo, > > How do we go about

Re: [Analytics] EventLogging RL modules can take a very long time to generate

2017-01-23 Thread Andrew Otto
​Wow, hm, I don’t know anything about ResourceLoader, but I am happy to learn! Will respond on ticket. ___ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics

[Analytics] stats.grok.se used in study about Snowden and internet traffic

2017-01-18 Thread Andrew Otto
Saw this on reddit: https://theintercept.com/2016/04/28/new-study-shows-mass-surveillance-breeds-meekness-fear-and-self-censorship/ >From the paper

Re: [Analytics] stats.wikimedia.org (and other sites) get a new server

2016-12-13 Thread Andrew Otto
This has been done! Let us know if you notice any problems. On Mon, Dec 12, 2016 at 1:51 PM, Andrew Otto <o...@wikimedia.org> wrote: > Hi all! > > We are replacing the server that hosts stats.wikimedia.org, > analytics.wikimedia.org, datasets.wikimedia.org, and other various

[Analytics] stats.wikimedia.org (and other sites) get a new server

2016-12-12 Thread Andrew Otto
Hi all! We are replacing the server that hosts stats.wikimedia.org, analytics.wikimedia.org, datasets.wikimedia.org, and other various sites this week. The new server is ready to go, so I’d like to do this tomorrow December 13th around 15:00 UTC. There should be no noticeable downtime for these

Re: [Analytics] eventlogging mysql/analytics stores maintenance

2016-12-07 Thread Andrew Otto
We’ll need to stop the eventlogging-consumer-mysql process on eventlog1001 while the master restart happens. I’m not working this Friday, but anyone on the Analytics team can do this. On Wed, Dec 7, 2016 at 1:27 PM, Jaime Crespo wrote: > Hi, > > I would like to do a

Re: [Analytics] Statsv

2016-11-14 Thread Andrew Otto
​+ops Analytics (Otto & Luca) probably have the most experience with python kafka clients, and also are the most likely to cause statsv problems, (due to analytics kafka broker restarts, etc.). So it makes sense for us to be at least partially responsible. On the other hand, statsv is for

Re: [Analytics] Parsing user agents in EventLogging data

2016-09-15 Thread Andrew Otto
I’ve added an example to https://wikitech.wikimedia.org/wiki/Analytics/EventLogging#Hive on how to use the UAParserUDF and the Hive get_json_object function to work with a user_agent_map. Unfortunately we can’t manage tables in Hive for every EventLogging schema/revision like we do in MySQL. So,

Re: [Analytics] Wiki pagecounts-raw is missing singe 08/05

2016-08-15 Thread Andrew Otto
See this announcement: https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html On Tue, Aug 9, 2016 at 4:15 PM, Kushal Tayal wrote: > Hi, > > I noticed that there are no wikipage counts that have been published since > 08/05 -

Re: [Analytics] pagecounts-all-sites

2016-08-15 Thread Andrew Otto
See this announcement: https://lists.wikimedia.org/pipermail/analytics/2016-August/005339.html On Fri, Aug 12, 2016 at 6:59 PM, Dylan Wenzlau wrote: > Hello analytics team! > > It seems that your amazingly useful pageview dumps >

Re: [Analytics] New EventLogging schemas don’t work after Kafka 0.9 upgrade

2016-05-13 Thread Andrew Otto
to auto create topics just fine. Phew! This didn’t actually affect production, and is fixed in beta now too. -Andrew [1] Fix here: https://gerrit.wikimedia.org/r/#/c/288604/ On Thu, May 12, 2016 at 5:33 PM, Andrew Otto <aco...@gmail.com> wrote: > Hi all! > > We just no

[Analytics] New EventLogging schemas don’t work after Kafka 0.9 upgrade

2016-05-12 Thread Andrew Otto
Hi all! We just noticed a problem with the (old) version kafka-python client we are using to produce EventLogging events to Kafka: it doesn’t handle creation of new topics now that we’ve upgraded the Kafka cluster to 0.9. This means that until we fix, events produced to new schemas will not be

Re: [Analytics] Hive & Oozie downtime tomorrow

2016-04-20 Thread Andrew Otto
Ok! Hive and Oozie are back up and running on the new box. We are fixing a few production Oozie job issues, but for regular users everything is back to normal. Proceed! Thanks for your patience! On Wed, Apr 20, 2016 at 10:32 AM, Andrew Otto <o...@wikimedia.org> wrote: > Hi eve

[Analytics] Hive & Oozie downtime tomorrow

2016-04-19 Thread Andrew Otto
Hi all! As part of https://phabricator.wikimedia.org/T130840, we need to schedule a short downtime for Hive and Oozie. I would like to proceed with this tomorrow if there are no objections. I’d like to schedule this downtime for an hour starting at 14:45 UTC (10:45 EST, 07:45 PST) Wednesday

[Analytics] webrequest misc usage?

2016-02-29 Thread Andrew Otto
Hi all, Ops is working on upgrading our web caching software from Varnish 3 to Varnish 4. At the moment, this would mean we would lost webrequest logs, since our webrequest logging software (varnishkafka) is incompatible with Varnish 4. There is an effort to fix this incompatibility, but Ops

[Analytics] Analytics Cluster maintenance for CDH 5.5 upgrade

2016-02-18 Thread Andrew Otto
Hiya, We’re ready to upgrade the Analytics Cluster to CDH 5.5. To do so, we need to schedule a maintenance period during which we can stop all Hadoop related services. This includes Hive, Oozie, Spark, etc. I’d like to plan this for Tuesday February 23rd starting at 14:00 UTC (09:00 US east

Re: [Analytics] Issues on Cluster

2016-02-01 Thread Andrew Otto
Woohoo! Thanks so much Joseph!!! :) On Sun, Jan 31, 2016 at 4:49 AM, Joseph Allemandou < jalleman...@wikimedia.org> wrote: > Hi All, > Everything is back to normal on the cluster, no data loss has been > incurred and jobs are up-to-date. > You can get back to your normal utilisation ! > Thanks

  1   2   >