Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-25 Thread Nuria Ruiz
Hello:

Following up on this issue, We think many of neil's issues come from the
fact that a kerberos ticket expires after 24 hours and once it does your
spark session would not work anymore. We will be extending expiration of
tickets somewhat to 2/3 days but main point to take home is that jupyter
notebooks do not live forever in the state you live them at, a restart of
kernel might be needed.

Please take a look at ticket:
https://phabricator.wikimedia.org/T246132

If anybody has been having these similar problems please chime in.

Thanks,

Nuria

On Thu, Feb 20, 2020 at 2:27 AM Luca Toscano  wrote:

> Hi Neil,
>
> I added the Analytics tag to https://phabricator.wikimedia.org/T245097,
> and also thanks for filing https://phabricator.wikimedia.org/T245713. We
> periodically review tasks in our incoming queue, so we should be able to
> help soon, but it may depend on priorities.
>
> Luca
>
> Il giorno gio 20 feb 2020 alle ore 06:21 Neil Shah-Quinn <
> nshahqu...@wikimedia.org> ha scritto:
>
>> Another update: I'm continuing to encounter these Spark errors and have
>> trouble recovering from them, even when I use proper settings. I've filed
>> T245713  to discuss this
>> further. The specific errors and behavior I'm seeing (for example, whether
>> explicitly calling session.stop allows a new functioning session to be
>> created) are not consistent, so I'm still trying to make sense of it.
>>
>> I would greatly appreciate any input or help, even if it's identifying
>> places where my description doesn't make sense.
>> 
>> 
>>
>> On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn 
>> wrote:
>>
>>> Bump!
>>>
>>> Analytics team, I'm eager to have input from y'all about the best Spark
>>> settings to use.
>>>
>>> On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn 
>>> wrote:
>>>
 I ran into this problem again, and I found that neither session.stop or
 newSession got rid of the error. So it's still not clear how to recover
 from a crashed(?) Spark session.

 On the other hand, I did figure out why my sessions were crashing in
 the first place, so hopefully recovering from that will be a rare need. The
 reason is that wmfdata doesn't modify
 
 the default Spark when it starts a new session, which was (for example)
 causing it to start executors with only ~400 MiB of memory each.

 I'm definitely going to change that, but it's not completely clear what
 the recommended settings for our cluster are. I cataloged the different
 recommendations at https://phabricator.wikimedia.org/T245097, and it
 would very helpful if one of y'all could give some clear recommendations
 about what the settings should be for local SWAP, YARN, and "large" YARN
 jobs. For example, is it important to increase spark.sql.shuffle.partitions
 for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
 job when the SWAP servers only have 64 GiB total?

 Thank you!




 On Fri, 7 Feb 2020 at 06:53, Andrew Otto  wrote:

> Hm, interesting!  I don't think many of us have used
> SparkSession.builder.getOrCreate repeatedly in the same process.
> What happens if you manually stop the spark session first, (
> session.stop()
> ?)
> or maybe try to explicitly create a new session via newSession()
> 
> ?
>
> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <
> nshahqu...@wikimedia.org> wrote:
>
>> Hi Luca!
>>
>> Those were separate Yarn jobs I started later. When I got this error,
>> I found that the Yarn job corresponding to the SparkContext was marked as
>> "successful", but I still couldn't get SparkSession.builder.getOrCreate 
>> to
>> open a new one.
>>
>> Any idea what might have caused that or how I could recover without
>> restarting the notebook, which could mean losing a lot of in-progress 
>> work?
>> I had already restarted that kernel so I don't know if I'll encounter 
>> this
>> problem again. If I do, I'll file a task.
>>
>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>> wrote:
>>
>>> Hey Neil,
>>>
>>> there were two Yarn jobs running related to your notebooks, I just
>>> killed them, let's see if it solves the problem (you might need to 
>>> restart
>>> again your notebook). If not, let's open a task and investigate :)
>>>
>>> Luca
>>>
>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>> nshahqu...@wikimedia.org> ha 

Re: [Analytics] Community health metrics kit: Input needed!

2020-02-25 Thread Dan Andreescu
In my last meeting with Joe they were still collecting requirements, but
that was before Joe got shifted around the org and I'm guessing the project
is now done.

This effort is exactly what I'm talking about in my email to product-all
and tech-all yesterday (subject: The Question and Answer Flow).  Here we
have a set of questions about community health that we've stopped asking
because it hasn't been possible to get answers.  I suggest folks who care
about this chime in there and point to how these metrics would help them.

On Tue, Feb 25, 2020 at 12:15 PM Ryan Kaldari 
wrote:

> Anyone else know anything about the fate of the community health metrics
> kit? The wiki page
> 
> still says it's expected to be launched in 2019.
>
> On Thu, Feb 20, 2020 at 1:19 PM Ryan Kaldari 
> wrote:
>
>> Hey Joe,
>> whatever happened with this? Is it still being worked on?
>>
>> On Fri, Oct 5, 2018 at 3:29 PM Joe Sutherland 
>> wrote:
>>
>>> Hello everyone - apologies for cross-posting! *TL;DR*: We would like
>>> your feedback on our Metrics Kit project. Please have a look and comment on
>>> Meta-Wiki:
>>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>>
>>>
>>> The Wikimedia Foundation's Trust and Safety team, in collaboration with
>>> the Community Health Initiative, is working on a Metrics Kit designed to
>>> measure the relative "health"[1] of various communities that make up the
>>> Wikimedia movement:
>>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>>
>>> The ultimate outcome will be a public suite of statistics and data
>>> looking at various aspects of Wikimedia project communities. This could be
>>> used by both community members to make decisions on their community
>>> direction and Wikimedia Foundation staff to point anti-harassment tool
>>> development in the right direction.
>>>
>>> We have a set of metrics we are thinking about including in the kit,
>>> ranging from the ratio of active users to active administrators,
>>> administrator confidence levels, and off-wiki factors such as freedom to
>>> participate. It's ambitious, and our methods of collecting such data will
>>> vary.
>>>
>>> Right now, we'd like to know:
>>> * Which metrics make sense to collect? Which don't? What are we missing?
>>> * Where would such a tool ideally be hosted? Where would you normally
>>> look for statistics like these?
>>> * We are aware of the overlap in scope between this and Wikistats <
>>> https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
>>> coexist?
>>>
>>> Your opinions will help to guide this project going forward. We'll be
>>> reaching out at different stages of this project, so if you're interested
>>> in direct messaging going forward, please feel free to indicate your
>>> interest by signing up on the consultation page.
>>>
>>> Looking forward to reading your thoughts.
>>>
>>> best,
>>> Joe
>>>
>>> P.S.: Please feel free to CC me in conversations that might happen on
>>> this list!
>>>
>>> [1] What do we mean by "health"? There is no standard definition of what
>>> makes a Wikimedia community "healthy", but there are many indicators that
>>> highlight where a wiki is doing well, and where it could improve. This
>>> project aims to provide a variety of useful data points that will inform
>>> community decisions that will benefit from objective data.
>>>
>>> --
>>> *Joe Sutherland* (he/him or they/them)
>>> Trust and Safety Specialist
>>> Wikimedia Foundation
>>> joesutherland.rocks
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Community health metrics kit: Input needed!

2020-02-25 Thread Ryan Kaldari
Anyone else know anything about the fate of the community health metrics
kit? The wiki page

still says it's expected to be launched in 2019.

On Thu, Feb 20, 2020 at 1:19 PM Ryan Kaldari  wrote:

> Hey Joe,
> whatever happened with this? Is it still being worked on?
>
> On Fri, Oct 5, 2018 at 3:29 PM Joe Sutherland 
> wrote:
>
>> Hello everyone - apologies for cross-posting! *TL;DR*: We would like
>> your feedback on our Metrics Kit project. Please have a look and comment on
>> Meta-Wiki:
>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>
>>
>> The Wikimedia Foundation's Trust and Safety team, in collaboration with
>> the Community Health Initiative, is working on a Metrics Kit designed to
>> measure the relative "health"[1] of various communities that make up the
>> Wikimedia movement:
>> https://meta.wikimedia.org/wiki/Community_health_initiative/Metrics_kit
>>
>> The ultimate outcome will be a public suite of statistics and data
>> looking at various aspects of Wikimedia project communities. This could be
>> used by both community members to make decisions on their community
>> direction and Wikimedia Foundation staff to point anti-harassment tool
>> development in the right direction.
>>
>> We have a set of metrics we are thinking about including in the kit,
>> ranging from the ratio of active users to active administrators,
>> administrator confidence levels, and off-wiki factors such as freedom to
>> participate. It's ambitious, and our methods of collecting such data will
>> vary.
>>
>> Right now, we'd like to know:
>> * Which metrics make sense to collect? Which don't? What are we missing?
>> * Where would such a tool ideally be hosted? Where would you normally
>> look for statistics like these?
>> * We are aware of the overlap in scope between this and Wikistats <
>> https://stats.wikimedia.org/v2/#/all-projects> — how might these tools
>> coexist?
>>
>> Your opinions will help to guide this project going forward. We'll be
>> reaching out at different stages of this project, so if you're interested
>> in direct messaging going forward, please feel free to indicate your
>> interest by signing up on the consultation page.
>>
>> Looking forward to reading your thoughts.
>>
>> best,
>> Joe
>>
>> P.S.: Please feel free to CC me in conversations that might happen on
>> this list!
>>
>> [1] What do we mean by "health"? There is no standard definition of what
>> makes a Wikimedia community "healthy", but there are many indicators that
>> highlight where a wiki is doing well, and where it could improve. This
>> project aims to provide a variety of useful data points that will inform
>> community decisions that will benefit from objective data.
>>
>> --
>> *Joe Sutherland* (he/him or they/them)
>> Trust and Safety Specialist
>> Wikimedia Foundation
>> joesutherland.rocks
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Research-Internal] Tutorials on disk space usage for notebook/stat boxes

2020-02-25 Thread Goran Milovanovic
Great job Luca. Thank you very much.

I have started to diversify all WMDE Analytics jobs (mainly Wikidata
related things) across the stat100* machines.
While I still mainly use stat1007, two modules of the WDCM
 system are
already migrated to stat1004.

Best,
Goran

Goran S. Milovanović, PhD
Data Scientist, Software Department
Wikimedia Deutschland


"It's not the size of the dog in the fight,
it's the size of the fight in the dog."
- Mark Twain



On Wed, Feb 19, 2020 at 4:33 AM Neil Shah-Quinn 
wrote:

> Thank you very much, Luca!
>
> To make this nice documentation easier to discover, I moved it to
> Analytics/Systems/Clients
>  along
> with the other information on the clients from Analytics/Data access.
>
> On Tue, 18 Feb 2020 at 17:11, Isaac Johnson  wrote:
>
>> Thanks for pulling together these directions Luca! I did a little
>> clean-up and will try to remember to do so more routinely.
>>
>> Adding to what Diego said, I also started using stat1007 because it has
>> the most access to resources (dumps, Hadoop, MariaDB), and then my virtual
>> environments, config files, etc. are there and so I tend to do all of my
>> work on stat1007 even when the other stat machines might work for other
>> projects. Putting the GPU on stat1005 helped me diversify a little but I'm
>> very excited to hear that the stat machines will be more standardized so it
>> matters less which machine I choose. While I have no desire to be spread
>> out across the machines (a few projects on stat1004, a few on stat1005,
>> etc.) because then I'll certainly lose track of where different projects
>> are, I would be open to trying to choose another host as my "main"
>> workspace.
>>
>> Best,
>> Isaac
>>
>> On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto  wrote:
>>
>>> I added a 'GPU?' column too. :)  THANKS LUCA!
>>>
>>> On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano 
>>> wrote:
>>>
 Hey Diego,

 added a section at the end of the page with the info requested, let me
 know if anything is missing :)

 Luca

 Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper <
 di...@wikimedia.org> ha scritto:

> Thanks for this Luca.
>
> I tend to use stat1007 because I know that machine has a lot of
> ram/cpu and HDFS access. From other statsX I'm not sure which of them have
> what resources (I know at least one of them doesn't have HDFS access).
> There is a table where I can look at a summary of resources per machine?
>
> Thanks again.
>
> On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano 
> wrote:
>
>> Hi everybody!
>>
>> I created the following doc:
>> https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nodes
>>
>> It contains two FAQ:
>> - How do I ensure that there is enough space on disk before storing
>> big datasets/files ?
>> - How do I check the space used by my files/data on stat/notebook
>> hosts ?
>>
>> Please read them and let me know if anything is not clear or missing.
>> We have plenty of space on stat100X hosts, but we tend to cluster on 
>> single
>> machines like stat1007 for some reason, ending up in fighting for 
>> resources.
>>
>> On a related note, we are going to work on unifying stat/notebook
>> puppet configs in https://phabricator.wikimedia.org/T243934, so
>> eventually all Analytics clients will be exactly the same.
>>
>> Thanks!
>>
>> Luca (on behalf of the Analytics team)
>>
>>
>> ___
>> Research-Internal mailing list
>> research-inter...@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/research-internal
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
 ___
 Research-Internal mailing list
 research-inter...@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/research-internal

>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>>
>>
>> --
>> Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics