[Analytics] Re: Mediacounts fields

2022-11-04 Thread Neil Shah-Quinn
I believe Connie Chen and Isaac Johnson did some work on distinguishing
"real images" from icons as part of the image suggestion analytics (T292316
<https://phabricator.wikimedia.org/T292316>). I don't know the details, but
perhaps one of them could chime in.

-
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>


On Fri, 4 Nov 2022 at 11:17, Dan Andreescu  wrote:

> hm, you know, maybe it's not such a great idea to show all these small
> files in the mediarequests/top endpoint.  I imagine everyone trying to use
> it would have the same problems you are.  Maybe we can brainstorm together
> on a way to filter out results you might not want.  If that top 1000 list
> included only images you found interesting, would that solve your problem?
> If so, let's brainstorm.
>
> So the schema of the data we have available is this
> <https://github.com/wikimedia/analytics-refinery/blob/master/hql/mediarequest/create_mediarequest_table.hql>
> .
>
> base_namestring COMMENT 'Base name of media file',
> media_classification string COMMENT 'General classification of media
> (image, video, audio, data, document or other)',
> file_typestring COMMENT 'Extension or suffix of the file (e.g.
> jpg, wav, pdf)',
> total_bytes  bigint COMMENT 'Total number of bytes',
> request_countbigint COMMENT 'Total number of requests',
> transcoding  string COMMENT 'Transcoding that the file was
> requested with, e.g. resized photo or image preview of a video',
> agent_type   string COMMENT 'Agent accessing the media files, can
> be spider or user',
> referer  string COMMENT 'Wiki project that the request was
> refered from. If project is not available, it will be either internal,
> external, or unknown',
> dt   string COMMENT 'UTC timestamp in ISO 8601 format
> (e.g. 2019-08-27T14:00:00Z)'
>
> And here's some sample data (request count > 5 for privacy).
>
>
> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","486642310","119697","image_0_199","user","en.wikipedia","2022-09-09T06:00:00Z","2022","9","9","6"
>
> "/wikipedia/commons/d/d4/Button_hide.png","image","png","26477640","93145","original","user","en.wikipedia","2022-09-09T23:00:00Z","2022","9","9","23"
>
> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg","image","svg","300264742","73620","image_0_199","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
>
> "/wikipedia/commons/2/23/Icons-mini-file_acrobat.gif","image","gif","27279795","93779","original","user","ja.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>
> "/wikipedia/en/a/ae/Flag_of_the_United_Kingdom.svg","image","svg","86260254","130257","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>
> "/wikipedia/commons/f/fa/Wikiquote-logo.svg","image","svg","254832231","83127","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>
> "/wikipedia/en/a/a4/Flag_of_the_United_States.svg","image","svg","76327061","90739","image_0_199","user","en.wikipedia","2022-09-09T03:00:00Z","2022","9","9","3"
>
> "/wikipedia/commons/b/b6/Queen_Elizabeth_II_in_March_2015.jpg","image","jpeg","1156030104","58651","image_200_399","user","en.wikipedia","2022-09-09T05:00:00Z","2022","9","9","5"
>
> "/wikipedia/commons/2/28/Aaj_tak_logo.png","image","png","57716837856","469335","original","user","external","2022-09-09T02:00:00Z","2022","9","9","2"
>
> "/wikipedia/commons/c/ca/Wiki_Loves_Monuments_Logo_notext.svg

[Analytics] Updated data in the wiki comparison tool

2022-02-07 Thread Neil Shah-Quinn
Hi everyone!

The wiki comparison tool [1] is a quick reference for the teeming ecosystem
of Wikimedia wikis maintained by the Product Analytics team [2] at the
Wikimedia Foundation. The tool has just been updated with more recent data
(covering Jan–Dec 2021), as well as bugfixes, documentation improvements,
and a new monthly pageviews field.

If you have questions, be sure to consult the documentation in the
"introduction", "change log", and "metric definitions" tabs. If you don't
find an answer or have feedback, please get in touch! You can reach us at
product-analyt...@wikimedia.org.

[1]
https://docs.google.com/spreadsheets/d/1a-UBqsYtJl6gpauJyanx0nyxuPqRvhzJRN817XpkuS8/
[2] https://www.mediawiki.org/wiki/Product_Analytics

-- 
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org


[Analytics] Help improve data documentation during the Wikimania hackathon!

2021-08-12 Thread Neil Shah-Quinn
Hey all!

Tomorrow, during the Wikimania hackathon, some of us will be hanging out
and working on improving our documentation on public data, dashboards, and
research support. Please join us! Wikimania registration is not required.

WHAT: Clean up, expand, and reorganize Meta-Wiki's documentation on
research, data, and dashboards
WHEN: During the Wikimania Hackathaon, Friday, 13 August 05:00 UTC to
Saturday, 14 August 05:00 UTC
WHERE:
https://meet.jit.si/moderated/3741d369509c72904f5247702a8c14a9d2d0b893ea3e3177b48eb50e7a7c8148
(feel free to join without audio and video if you just want to text chat)

More information: https://phabricator.wikimedia.org/T288680

-- 
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Analytics mailing list -- analytics@lists.wikimedia.org
To unsubscribe send an email to analytics-le...@lists.wikimedia.org


Re: [Analytics] Wiki comparison 2020 data is available

2021-02-24 Thread Neil Shah-Quinn
On Wed, 24 Feb 2021 at 20:04, Samat  wrote:

> 1) I don't understand "overall size rank". Based on the definitions, it is
> the "count of unique devices which visited the wiki during that month".
> What does this do with the size of the wiki?
>

Thank you very much for noticing this! It looks like we accidentally mixed
up some of the definitions. I've restored the correct definition of overall
size rank, which is "a composite ranking of wikis by monthly active editors
and monthly unique devices (produced by taking the geometric mean of those
two values)". I checked all of the other definitions and they should be
correct.


> 2) I miss the *size of the database* (main namespace/content pages),
> which together with the number of content pages would give an impression of
> the mean size of the articles (I know this is not perfect, but better than
> nothing).
> 3) Many more wishes about additional metrics :D
>

Size of the database is a good suggestion, but sadly my team doesn't have
time to add any more metrics right now. If you'd like to note down your
suggestion for the future, we have been keeping a list at
mediawiki.org/wiki/Product_Analytics/Wiki_comparison_suggestions
<https://www.mediawiki.org/wiki/Product_Analytics/Wiki_comparison_suggestions>
.

(We'd also be thrilled to help others expand the dataset directly, but this
is likely to be difficult since our code
<https://github.com/wikimedia-research/wiki-segmentation/tree/master/data-collection>
currently relies on production data access
<https://wikitech.wikimedia.org/wiki/Analytics/Data_access>, which is only
available to staff of the Wikimedia Foundation or Wikimedia
Deutschland or official
research collaborators
<https://www.mediawiki.org/wiki/Wikimedia_Research/Formal_collaborations>.)

-- 
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] Seeking Information Regarding Pageview Traffic

2020-12-21 Thread Neil Shah-Quinn
Unfortunately, I can't tell you anything more than what you already know! I
think that huge, temporary spikes in edits or pageviews that don't match
expected patterns of human use (like the death of a celebrity or a big
editing campaign) are most likely caused by bots. With editing spikes, I
can usually confirm this belief by examining the edits. With pageview
spikes, it's much harder. If the spike was in the last 90 days, I could
investigate more by looking at the confidential raw traffic data
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest>,
but after 90 days, that data is deleted to protect user privacy.
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/Webrequest>
The case you mentioned fits all my criteria. First, it is a huge, temporary
spike. Second, it doesn't match expected patterns of human use: there is no
matching spike in mobile pageviews and the pages involved are not pages
humans would want to read. So, it's for these reasons only that I am
confident that it was caused by bots.

Now, *why* would someone use a bot to access millions of Bangla Wikipedia
articles for a single month? I have no idea. It could just be a programmer
somewhere doing an experiment. Your guess is as good as mine 

On Fri, 18 Dec 2020 at 21:52, Ankan Ghosh Dastider <
ankanghoshdasti...@gmail.com> wrote:

> Hi Neil,
>
> Thank you very much for responding so fast.
>
> That's can be the potential answer! Can you please share any definite (or
> relative) information regarding the error at that time, if possible? Can
> you give me any idea on why the bot view increases so much on a certain
> year (and on some certain dates)? If possible, any example will be really
> helpful.
>
>
> Ankan
>
> On Fri, Dec 18, 2020 at 10:01 PM Neil Shah-Quinn 
> wrote:
>
>> That's a good question! I think the most likely explanation is that a bot
>> automatically viewed those pages. I see that you have already removed
>> "spider" and "automated" traffic in your Wikistats graphs, but those
>> classifications are not perfect. Before March 2020, they only detected bots
>> that explicitly marked themselves as bots. Now, our methods are more
>> sophisticated
>> <https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection>,
>> but I am sure they still miss some things.
>>
>> On Fri, 18 Dec 2020 at 18:48, Ankan Ghosh Dastider <
>> ankanghoshdasti...@gmail.com> wrote:
>>
>>> Hello everyone,
>>>
>>> I am Ankan, a Wikimedian from Bangladesh. Recently, I was searching for
>>> the Wikimedia stats website for research purposes. I got a bit curious
>>> regarding the Bengali Wikipedia total page view section
>>> <https://stats.wikimedia.org/#/bn.wikipedia.org/reading/total-page-views/normal%7Cbar%7Call%7Caccess~desktop*mobile-app*mobile-web+(agent)~user%7Cmonthly>,
>>> as the traffic didn't match the normal flow in January 2018 and faced a
>>> sudden surge of desktop access by users. It is unprecedented and highest
>>> till today. If you check the normal rate of desktop access, you will see
>>> that it is almost 450% than the second highest.
>>>
>>> The pageview result suggests that the top-visited pages are
>>> category-related and date-related pages (the highest visited one is
>>> 'Category:Stubs', see here
>>> <https://pageviews.toolforge.org/?project=bn.wikipedia.org=desktop=user=0=2018-01-01=2018-01-31=%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80:%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%8D%E0%A6%AA%E0%A7%82%E0%A6%B0%E0%A7%8D%E0%A6%A3>)
>>> which is quite enigmatic as these pages are hardly viewed by the general
>>> readers. The result of certain dates in January 2018 is completely
>>> exceptional.
>>>
>>> Note that, I have checked some other languages and the rate is normal
>>> there.
>>>
>>> I am seeking your assistance to analyze the probable reason behind this
>>> surge. Thanks in advance!
>>>
>>>
>>> Best regards,
>>> Ankan
>>>
>>> --
>>> Ankan Ghosh Dastider (he/him)
>>> User:ANKAN <https://meta.wikimedia.org/wiki/User:ANKAN> || All Wikimedia
>>> Foundation <https://meta.wikimedia.org/wiki/Wikimedia_Foundation>'s
>>> public Wiki
>>> Executive Member || Wikimedia Bangladesh <http://wikimedia.org.bd/>
>>> Twitter <https://twitter.com/Iagdastider>  |  LinkedIn
>>> <https://www.linkedin.com/in/ankan-ghosh-dastider/>  |  ResearchGate
>>> <https://www.researchgate.net/profile/Ankan_Ghosh_Dastider&g

Re: [Analytics] Seeking Information Regarding Pageview Traffic

2020-12-18 Thread Neil Shah-Quinn
That's a good question! I think the most likely explanation is that a bot
automatically viewed those pages. I see that you have already removed
"spider" and "automated" traffic in your Wikistats graphs, but those
classifications are not perfect. Before March 2020, they only detected bots
that explicitly marked themselves as bots. Now, our methods are more
sophisticated
<https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection>,
but I am sure they still miss some things.

On Fri, 18 Dec 2020 at 18:48, Ankan Ghosh Dastider <
ankanghoshdasti...@gmail.com> wrote:

> Hello everyone,
>
> I am Ankan, a Wikimedian from Bangladesh. Recently, I was searching for
> the Wikimedia stats website for research purposes. I got a bit curious
> regarding the Bengali Wikipedia total page view section
> <https://stats.wikimedia.org/#/bn.wikipedia.org/reading/total-page-views/normal%7Cbar%7Call%7Caccess~desktop*mobile-app*mobile-web+(agent)~user%7Cmonthly>,
> as the traffic didn't match the normal flow in January 2018 and faced a
> sudden surge of desktop access by users. It is unprecedented and highest
> till today. If you check the normal rate of desktop access, you will see
> that it is almost 450% than the second highest.
>
> The pageview result suggests that the top-visited pages are
> category-related and date-related pages (the highest visited one is
> 'Category:Stubs', see here
> <https://pageviews.toolforge.org/?project=bn.wikipedia.org=desktop=user=0=2018-01-01=2018-01-31=%E0%A6%AC%E0%A6%BF%E0%A6%B7%E0%A6%AF%E0%A6%BC%E0%A6%B6%E0%A7%8D%E0%A6%B0%E0%A7%87%E0%A6%A3%E0%A7%80:%E0%A6%85%E0%A6%B8%E0%A6%AE%E0%A7%8D%E0%A6%AA%E0%A7%82%E0%A6%B0%E0%A7%8D%E0%A6%A3>)
> which is quite enigmatic as these pages are hardly viewed by the general
> readers. The result of certain dates in January 2018 is completely
> exceptional.
>
> Note that, I have checked some other languages and the rate is normal
> there.
>
> I am seeking your assistance to analyze the probable reason behind this
> surge. Thanks in advance!
>
>
> Best regards,
> Ankan
>
> --
> Ankan Ghosh Dastider (he/him)
> User:ANKAN <https://meta.wikimedia.org/wiki/User:ANKAN> || All Wikimedia
> Foundation <https://meta.wikimedia.org/wiki/Wikimedia_Foundation>'s
> public Wiki
> Executive Member || Wikimedia Bangladesh <http://wikimedia.org.bd/>
> Twitter <https://twitter.com/Iagdastider>  |  LinkedIn
> <https://www.linkedin.com/in/ankan-ghosh-dastider/>  |  ResearchGate
> <https://www.researchgate.net/profile/Ankan_Ghosh_Dastider>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>


-- 
Neil Shah-Quinn
senior data scientist, Product Analytics
<https://www.mediawiki.org/wiki/Product_Analytics>
Wikimedia Foundation <https://wikimediafoundation.org/>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] "automated" marker added to pageview data

2020-05-13 Thread Neil Shah-Quinn
Nuria,

Thank you for this update! I'm very excited about this new system.

I did notice that there's not much explanation of the particular rules or
strategies that are used to identify automated traffic, or a link to the
implementing code. I can imagine this might be intentional, to make it
harder for the spammers and vandals to evade the system. If so, it would be
helpful to update the page to say that explicitly and explain how people
can request more details if they have a legitimate need for them.

On Tue, 5 May 2020 at 02:40, Nuria Ruiz  wrote:

> Hello:
>
> We have added the 'automated' maker to Wikimedia's pageview data. Up to
> now pageview agents were classified as 'spider' (self reported bots like
> 'google bot' or 'bing bot') and 'user'.
>
> We have known for a while that some requests classified as 'user' were, in
> fact, coming from automated agents not disclosed as such. This was a well
> known fact for our community as for a couple years now they have been
> applying filtering rules for any "Top X" list compiled [1]. We have
> incorporated some of these filters (and others) to our automated traffic
> detection and, as of this week, traffic that meets the filtering
> criteria is now automatically excluded from being counted towards "top"
> lists reported by the pageview API.
>
> The effect of removing pageviews marked as 'automated' from the overall
> user traffic is about a 5.6% reduction of pageviews labeled as "user" [2]
> in the course of  a month. Not all projects are affected equally when it
> comes to reduction of "user pageviews". The biggest effect is on English
> Wikipedia (8-10%). However, projects like the Japanese Wikipedia are mildly
> affected (< 1%).
>
> If you are curious as what problems this type of traffic causes in the
> data, this ticket for Hungarian Wikipedia is a good example of issues
> inflicted by what we call "bot vandalism/bot spam":
> https://phabricator.wikimedia.org/T237282
>
> Given the delicate nature of this data we have worked for many months now
> on vetting the algorithms we are using. We will appreciate reports via phab
> ticket for any issues you might find.
>
> Thanks,
>
> Nuria
>
> [1] https://en.wikipedia.org/wiki/Wikipedia:2018_Top_50_Report#Exclusions
> [2]
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Traffic/BotDetection#Global_Impact_-_All_wikimedia_projects
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] wmfdata users: update to version 1.0

2020-03-13 Thread Neil Shah-Quinn
If you use the wmfdata <https://github.com/neilpquinn/wmfdata> Python
package, your version probably no longer works. You can fix this by
updating to version 1.0 using the following terminal command:
pip install --upgrade git+https://github.com/neilpquinn/wmfdata.git@release


*More details*
wmfdata <https://github.com/neilpquinn/wmfdata> is a Python package that
streamlines data analysis on SWAP <https://wikitech.wikimedia.org/wiki/SWAP>
(Wikimedia's JupyterHub service for private data). We have just released
version 1.0. This is our first properly-versioned release and is filled
with many improvements,
<https://github.com/neilpquinn/wmfdata/blob/master/CHANGELOG.md#100-13-march-2020>
including much better settings for Spark sessions and the ability to run
SQL using Hive's command line interface.

The package automatically checks for updates when imported and notifies the
user if any are available; however, due to changes we've made setting up
our release process, the check in old versions will now encounter an error
which will prevent the package from being imported.

The check now raises a warning rather than an error if something goes
wrong, so future changes will not cause such disruption. Bricking
<https://www.techdirt.com/articles/20200220/13121043955/next-risk-buying-iot-product-is-having-it-bricked-patent-dispute.shtml>
old
<https://www.theverge.com/2016/4/4/11362928/google-nest-revolv-shutdown-smart-home-products>
versions
<https://www.ksby.com/news/national-news/spectrum-ending-home-security-leaving-customers-scrambling>
is definitely not good practice!

If you do data analysis on SWAP, but haven't tried wmfdata, please check it
out <https://github.com/neilpquinn/wmfdata>! It has lots of useful features
and we have even more planned
<https://phabricator.wikimedia.org/maniphest/?project=PHID-PROJ-ajtbhv3nvefa4poqypoe=open()=none=newest#R>.
We also have similarly useful R package
<https://phabricator.wikimedia.org/diffusion/1821/>.

If you have any questions or comments, please do email us at
product-analyt...@wikimedia.org.

On behalf of the Product Analytics team,
Neil Shah-Quinn
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-19 Thread Neil Shah-Quinn
Another update: I'm continuing to encounter these Spark errors and have
trouble recovering from them, even when I use proper settings. I've filed
T245713 <https://phabricator.wikimedia.org/T245713> to discuss this
further. The specific errors and behavior I'm seeing (for example, whether
explicitly calling session.stop allows a new functioning session to be
created) are not consistent, so I'm still trying to make sense of it.

I would greatly appreciate any input or help, even if it's identifying
places where my description doesn't make sense.
<https://phabricator.wikimedia.org/T245713>
<https://phabricator.wikimedia.org/T245713>

On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn 
wrote:

> Bump!
>
> Analytics team, I'm eager to have input from y'all about the best Spark
> settings to use.
>
> On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn 
> wrote:
>
>> I ran into this problem again, and I found that neither session.stop or
>> newSession got rid of the error. So it's still not clear how to recover
>> from a crashed(?) Spark session.
>>
>> On the other hand, I did figure out why my sessions were crashing in the
>> first place, so hopefully recovering from that will be a rare need. The
>> reason is that wmfdata doesn't modify
>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60>
>> the default Spark when it starts a new session, which was (for example)
>> causing it to start executors with only ~400 MiB of memory each.
>>
>> I'm definitely going to change that, but it's not completely clear what
>> the recommended settings for our cluster are. I cataloged the different
>> recommendations at https://phabricator.wikimedia.org/T245097, and it
>> would very helpful if one of y'all could give some clear recommendations
>> about what the settings should be for local SWAP, YARN, and "large" YARN
>> jobs. For example, is it important to increase spark.sql.shuffle.partitions
>> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
>> job when the SWAP servers only have 64 GiB total?
>>
>> Thank you!
>>
>>
>>
>>
>> On Fri, 7 Feb 2020 at 06:53, Andrew Otto  wrote:
>>
>>> Hm, interesting!  I don't think many of us have used
>>> SparkSession.builder.getOrCreate repeatedly in the same process.  What
>>> happens if you manually stop the spark session first, (session.stop()
>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>>> or maybe try to explicitly create a new session via newSession()
>>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>>> ?
>>>
>>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
>>> wrote:
>>>
>>>> Hi Luca!
>>>>
>>>> Those were separate Yarn jobs I started later. When I got this error, I
>>>> found that the Yarn job corresponding to the SparkContext was marked as
>>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>>>> open a new one.
>>>>
>>>> Any idea what might have caused that or how I could recover without
>>>> restarting the notebook, which could mean losing a lot of in-progress work?
>>>> I had already restarted that kernel so I don't know if I'll encounter this
>>>> problem again. If I do, I'll file a task.
>>>>
>>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>>>> wrote:
>>>>
>>>>> Hey Neil,
>>>>>
>>>>> there were two Yarn jobs running related to your notebooks, I just
>>>>> killed them, let's see if it solves the problem (you might need to restart
>>>>> again your notebook). If not, let's open a task and investigate :)
>>>>>
>>>>> Luca
>>>>>
>>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>>> nshahqu...@wikimedia.org> ha scritto:
>>>>>
>>>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>>>> after restarting the notebook, without an intermediate Java heap space
>>>>>> error. That seems very strange to me.
>>>>>>
>>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn <
>>>>>> nshahqu...@wikimedia.org> wrote:
>>>>>>
>>>>>>> Hey there!
>>>>>>>
>>>>>>> I was runnin

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-19 Thread Neil Shah-Quinn
Bump!

Analytics team, I'm eager to have input from y'all about the best Spark
settings to use.

On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn 
wrote:

> I ran into this problem again, and I found that neither session.stop or
> newSession got rid of the error. So it's still not clear how to recover
> from a crashed(?) Spark session.
>
> On the other hand, I did figure out why my sessions were crashing in the
> first place, so hopefully recovering from that will be a rare need. The
> reason is that wmfdata doesn't modify
> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60>
> the default Spark when it starts a new session, which was (for example)
> causing it to start executors with only ~400 MiB of memory each.
>
> I'm definitely going to change that, but it's not completely clear what
> the recommended settings for our cluster are. I cataloged the different
> recommendations at https://phabricator.wikimedia.org/T245097, and it
> would very helpful if one of y'all could give some clear recommendations
> about what the settings should be for local SWAP, YARN, and "large" YARN
> jobs. For example, is it important to increase spark.sql.shuffle.partitions
> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local
> job when the SWAP servers only have 64 GiB total?
>
> Thank you!
>
>
>
>
> On Fri, 7 Feb 2020 at 06:53, Andrew Otto  wrote:
>
>> Hm, interesting!  I don't think many of us have used
>> SparkSession.builder.getOrCreate repeatedly in the same process.  What
>> happens if you manually stop the spark session first, (session.stop()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>> or maybe try to explicitly create a new session via newSession()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>> ?
>>
>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
>> wrote:
>>
>>> Hi Luca!
>>>
>>> Those were separate Yarn jobs I started later. When I got this error, I
>>> found that the Yarn job corresponding to the SparkContext was marked as
>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>>> open a new one.
>>>
>>> Any idea what might have caused that or how I could recover without
>>> restarting the notebook, which could mean losing a lot of in-progress work?
>>> I had already restarted that kernel so I don't know if I'll encounter this
>>> problem again. If I do, I'll file a task.
>>>
>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>>> wrote:
>>>
>>>> Hey Neil,
>>>>
>>>> there were two Yarn jobs running related to your notebooks, I just
>>>> killed them, let's see if it solves the problem (you might need to restart
>>>> again your notebook). If not, let's open a task and investigate :)
>>>>
>>>> Luca
>>>>
>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>> nshahqu...@wikimedia.org> ha scritto:
>>>>
>>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>>> after restarting the notebook, without an intermediate Java heap space
>>>>> error. That seems very strange to me.
>>>>>
>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
>>>>> wrote:
>>>>>
>>>>>> Hey there!
>>>>>>
>>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
>>>>>> Java heap space".
>>>>>>
>>>>>> After that, when I tried to call the spark.sql function again (via
>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: 
>>>>>> Cannot
>>>>>> call methods on a stopped SparkContext."
>>>>>>
>>>>>> When I tried to create a new Spark context using
>>>>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
>>>>>> or directly), it returned a SparkContent object properly, but calling the
>>>>>> object's sql function still gave the "stopped SparkContext error".
>>>>>>
>>>>>> Any id

Re: [Analytics] [Research-Internal] Tutorials on disk space usage for notebook/stat boxes

2020-02-18 Thread Neil Shah-Quinn
Thank you very much, Luca!

To make this nice documentation easier to discover, I moved it to
Analytics/Systems/Clients
 along with
the other information on the clients from Analytics/Data access.

On Tue, 18 Feb 2020 at 17:11, Isaac Johnson  wrote:

> Thanks for pulling together these directions Luca! I did a little clean-up
> and will try to remember to do so more routinely.
>
> Adding to what Diego said, I also started using stat1007 because it has
> the most access to resources (dumps, Hadoop, MariaDB), and then my virtual
> environments, config files, etc. are there and so I tend to do all of my
> work on stat1007 even when the other stat machines might work for other
> projects. Putting the GPU on stat1005 helped me diversify a little but I'm
> very excited to hear that the stat machines will be more standardized so it
> matters less which machine I choose. While I have no desire to be spread
> out across the machines (a few projects on stat1004, a few on stat1005,
> etc.) because then I'll certainly lose track of where different projects
> are, I would be open to trying to choose another host as my "main"
> workspace.
>
> Best,
> Isaac
>
> On Tue, Feb 18, 2020 at 10:53 AM Andrew Otto  wrote:
>
>> I added a 'GPU?' column too. :)  THANKS LUCA!
>>
>> On Tue, Feb 18, 2020 at 11:51 AM Luca Toscano 
>> wrote:
>>
>>> Hey Diego,
>>>
>>> added a section at the end of the page with the info requested, let me
>>> know if anything is missing :)
>>>
>>> Luca
>>>
>>> Il giorno mar 18 feb 2020 alle ore 17:37 Diego Saez-Trumper <
>>> di...@wikimedia.org> ha scritto:
>>>
 Thanks for this Luca.

 I tend to use stat1007 because I know that machine has a lot of ram/cpu
 and HDFS access. From other statsX I'm not sure which of them have what
 resources (I know at least one of them doesn't have HDFS access). There is
 a table where I can look at a summary of resources per machine?

 Thanks again.

 On Tue, Feb 18, 2020 at 8:53 AM Luca Toscano 
 wrote:

> Hi everybody!
>
> I created the following doc:
> https://wikitech.wikimedia.org/wiki/Analytics/Tutorials/Analytics_Client_Nodes
>
> It contains two FAQ:
> - How do I ensure that there is enough space on disk before storing
> big datasets/files ?
> - How do I check the space used by my files/data on stat/notebook
> hosts ?
>
> Please read them and let me know if anything is not clear or missing.
> We have plenty of space on stat100X hosts, but we tend to cluster on 
> single
> machines like stat1007 for some reason, ending up in fighting for 
> resources.
>
> On a related note, we are going to work on unifying stat/notebook
> puppet configs in https://phabricator.wikimedia.org/T243934, so
> eventually all Analytics clients will be exactly the same.
>
> Thanks!
>
> Luca (on behalf of the Analytics team)
>
>
> ___
> Research-Internal mailing list
> research-inter...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/research-internal
>
 ___
 Analytics mailing list
 Analytics@lists.wikimedia.org
 https://lists.wikimedia.org/mailman/listinfo/analytics

>>> ___
>>> Research-Internal mailing list
>>> research-inter...@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/research-internal
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
>
>
> --
> Isaac Johnson (he/him/his) -- Research Scientist -- Wikimedia Foundation
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-14 Thread Neil Shah-Quinn
I ran into this problem again, and I found that neither session.stop or
newSession got rid of the error. So it's still not clear how to recover
from a crashed(?) Spark session.

On the other hand, I did figure out why my sessions were crashing in the
first place, so hopefully recovering from that will be a rare need. The
reason is that wmfdata doesn't modify
<https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60>
the default Spark when it starts a new session, which was (for example)
causing it to start executors with only ~400 MiB of memory each.

I'm definitely going to change that, but it's not completely clear what the
recommended settings for our cluster are. I cataloged the different
recommendations at https://phabricator.wikimedia.org/T245097, and it would
very helpful if one of y'all could give some clear recommendations about
what the settings should be for local SWAP, YARN, and "large" YARN jobs.
For example, is it important to increase spark.sql.shuffle.partitions for
YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local job
when the SWAP servers only have 64 GiB total?

Thank you!




On Fri, 7 Feb 2020 at 06:53, Andrew Otto  wrote:

> Hm, interesting!  I don't think many of us have used 
> SparkSession.builder.getOrCreate
> repeatedly in the same process.  What happens if you manually stop the
> spark session first, (session.stop()
> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
> or maybe try to explicitly create a new session via newSession()
> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
> ?
>
> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
> wrote:
>
>> Hi Luca!
>>
>> Those were separate Yarn jobs I started later. When I got this error, I
>> found that the Yarn job corresponding to the SparkContext was marked as
>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>> open a new one.
>>
>> Any idea what might have caused that or how I could recover without
>> restarting the notebook, which could mean losing a lot of in-progress work?
>> I had already restarted that kernel so I don't know if I'll encounter this
>> problem again. If I do, I'll file a task.
>>
>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano  wrote:
>>
>>> Hey Neil,
>>>
>>> there were two Yarn jobs running related to your notebooks, I just
>>> killed them, let's see if it solves the problem (you might need to restart
>>> again your notebook). If not, let's open a task and investigate :)
>>>
>>> Luca
>>>
>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>> nshahqu...@wikimedia.org> ha scritto:
>>>
>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>> after restarting the notebook, without an intermediate Java heap space
>>>> error. That seems very strange to me.
>>>>
>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
>>>> wrote:
>>>>
>>>>> Hey there!
>>>>>
>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
>>>>> Java heap space".
>>>>>
>>>>> After that, when I tried to call the spark.sql function again (via
>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
>>>>> call methods on a stopped SparkContext."
>>>>>
>>>>> When I tried to create a new Spark context using
>>>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
>>>>> or directly), it returned a SparkContent object properly, but calling the
>>>>> object's sql function still gave the "stopped SparkContext error".
>>>>>
>>>>> Any idea what's going on? I assume restarting the notebook kernel
>>>>> would take care of the problem, but it seems like there has to be a better
>>>>> way to recover.
>>>>>
>>>>> Thank you!
>>>>>
>>>>>
>>>>> ___
>>>> Analytics mailing list
>>>> Analytics@lists.wikimedia.org
>>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>>
>>> ___
>>> Analytics mailing list
>>> Analytics@lists.wikimedia.org
>>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>>
>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] [Wiki-research-l] Announcement - Mediawiki History Dumps

2020-02-10 Thread Neil Shah-Quinn
I want to echo what Nate said. We've been using this for more than a year
within the Wikimedia Foundation, and it has made analyses of editing
behavior much, much easier and faster, not to mention a lot less annoying.

This is the product of years of expert work by the Analytics team, and they
deserve plenty of congratulations for it 

On Mon, 10 Feb 2020 at 10:42, Nate E TeBlunthuis  wrote:

> Thank you so much Joal! I've been happily using this data for some time
> and I'm optimistic that it can make doing thorough analyses of Wikimedia
> projects much more accessible to the community, students, and researchers.
>
> -- Nate
> --
> *From:* Wiki-research-l  on
> behalf of Joseph Allemandou 
> *Sent:* Monday, February 10, 2020 8:27 AM
> *To:* A mailing list for the Analytics Team at WMF and everybody who has
> an interest in Wikipedia and analytics. ;
> Research into Wikimedia content and communities <
> wiki-researc...@lists.wikimedia.org>; Product Analytics <
> product-analyt...@wikimedia.org>
> *Subject:* [Wiki-research-l] Announcement - Mediawiki History Dumps
>
> Hi Analytics People,
>
> The Wikimedia Analytics Team is pleased to announce the release of the most
> complete dataset we have to date to analyze content and contributors
> metadata: Mediawiki History [1] [2].
>
> Data is in TSV format, released monthly around the 3rd of the month
> usually, and every new release contains the full history of metadata.
>
> The dataset contains an enhanced [3] and historified [4] version of user,
> page and revision metadata and serves as a base to Wiksitats API on edits,
> users and pages [5] [6].
>
> We hope you will have as much fun playing with the data as we have building
> it, and we're eager to hear from you [7], whether for issues, ideas or
> usage of the data.
>
> Analytically yours,
>
> --
> Joseph Allemandou (joal) (he / him)
> Sr Data Engineer
> Wikimedia Foundation
>
> [1] https://dumps.wikimedia.org/other/mediawiki_history/readme.html
> [2]
>
> https://wikitech.wikimedia.org/wiki/Analytics/Data_Lake/Edits/Mediawiki_history_dumps
> [3] Many pre-computed fields are present in the dataset, from edit-counts
> by user and page to reverts and reverted information, as well as time
> between events.
> [4] As accurate as possible historical usernames and page-titles (as well
> as user-groups and blocks) is available in addition to current values, and
> are provided in a denormalized way to every event of the dataset.
> [5] https://wikitech.wikimedia.org/wiki/Analytics/AQS/Wikistats_2
> [6] https://wikimedia.org/api/rest_v1/
> [7]
>
> https://phabricator.wikimedia.org/maniphest/task/edit/?title=Mediawiki%20History%20Dumps=Analytics-Wikistats,Analytics
> ___
> Wiki-research-l mailing list
> wiki-researc...@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wiki-research-l
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-07 Thread Neil Shah-Quinn
Good suggestions, Andrew! I'll try those if I encounter this again.

Nuria, we had a discussion about the appropriate places to ask questions
about internal systems in October 2018, and the verdict (supported by you)
was that we should use this list or the public IRC channel.

If you want to revisit that decision, I'd suggest you consult that thread
first (the subject was "Where to ask questions about internal analytics
tools") because I included a detailed list of pros and cons of different
channels to start the discussion. In that list, I even mentioned that such
discussions on this channel could annoy subscribers who don't have access
to these systems 

If you still want us to use a different list, we can certainly do that. If
so, please send my team a message and update the docs I added
<https://wikitech.wikimedia.org/wiki/Analytics#Contact> so it stays clear.

On Fri, 7 Feb 2020 at 07:48, Nuria Ruiz  wrote:

> Hello,
>
> Probably this discussion is not of wide interest to this public list, I
> suggest to move it to analytics-internal?
>
> Thanks,
>
> Nuria
>
> On Fri, Feb 7, 2020 at 6:53 AM Andrew Otto  wrote:
>
>> Hm, interesting!  I don't think many of us have used
>> SparkSession.builder.getOrCreate repeatedly in the same process.  What
>> happens if you manually stop the spark session first, (session.stop()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?)
>> or maybe try to explicitly create a new session via newSession()
>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession>
>> ?
>>
>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn 
>> wrote:
>>
>>> Hi Luca!
>>>
>>> Those were separate Yarn jobs I started later. When I got this error, I
>>> found that the Yarn job corresponding to the SparkContext was marked as
>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to
>>> open a new one.
>>>
>>> Any idea what might have caused that or how I could recover without
>>> restarting the notebook, which could mean losing a lot of in-progress work?
>>> I had already restarted that kernel so I don't know if I'll encounter this
>>> problem again. If I do, I'll file a task.
>>>
>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano 
>>> wrote:
>>>
>>>> Hey Neil,
>>>>
>>>> there were two Yarn jobs running related to your notebooks, I just
>>>> killed them, let's see if it solves the problem (you might need to restart
>>>> again your notebook). If not, let's open a task and investigate :)
>>>>
>>>> Luca
>>>>
>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
>>>> nshahqu...@wikimedia.org> ha scritto:
>>>>
>>>>> Whoa—I just got the same stopped SparkContext error on the query even
>>>>> after restarting the notebook, without an intermediate Java heap space
>>>>> error. That seems very strange to me.
>>>>>
>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
>>>>> wrote:
>>>>>
>>>>>> Hey there!
>>>>>>
>>>>>> I was running SQL queries via PySpark (using the wmfdata package
>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>)
>>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError:
>>>>>> Java heap space".
>>>>>>
>>>>>> After that, when I tried to call the spark.sql function again (via
>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: 
>>>>>> Cannot
>>>>>> call methods on a stopped SparkContext."
>>>>>>
>>>>>> When I tried to create a new Spark context using
>>>>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
>>>>>> or directly), it returned a SparkContent object properly, but calling the
>>>>>> object's sql function still gave the "stopped SparkContext error".
>>>>>>
>>>>>> Any idea what's going on? I assume restarting the notebook kernel
>>>>>> would take care of the problem, but it seems like there has to be a 
>>>>>> better
>>>>>> way to recover.
>>>>>>
>>>>>> Thank you!
>>&g

Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-06 Thread Neil Shah-Quinn
Hi Luca!

Those were separate Yarn jobs I started later. When I got this error, I
found that the Yarn job corresponding to the SparkContext was marked as
"successful", but I still couldn't get SparkSession.builder.getOrCreate to
open a new one.

Any idea what might have caused that or how I could recover without
restarting the notebook, which could mean losing a lot of in-progress work?
I had already restarted that kernel so I don't know if I'll encounter this
problem again. If I do, I'll file a task.

On Wed, 5 Feb 2020 at 23:24, Luca Toscano  wrote:

> Hey Neil,
>
> there were two Yarn jobs running related to your notebooks, I just killed
> them, let's see if it solves the problem (you might need to restart again
> your notebook). If not, let's open a task and investigate :)
>
> Luca
>
> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn <
> nshahqu...@wikimedia.org> ha scritto:
>
>> Whoa—I just got the same stopped SparkContext error on the query even
>> after restarting the notebook, without an intermediate Java heap space
>> error. That seems very strange to me.
>>
>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
>> wrote:
>>
>>> Hey there!
>>>
>>> I was running SQL queries via PySpark (using the wmfdata package
>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) on
>>> SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java
>>> heap space".
>>>
>>> After that, when I tried to call the spark.sql function again (via
>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
>>> call methods on a stopped SparkContext."
>>>
>>> When I tried to create a new Spark context using
>>> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
>>> or directly), it returned a SparkContent object properly, but calling the
>>> object's sql function still gave the "stopped SparkContext error".
>>>
>>> Any idea what's going on? I assume restarting the notebook kernel would
>>> take care of the problem, but it seems like there has to be a better way to
>>> recover.
>>>
>>> Thank you!
>>>
>>>
>>> ___
>> Analytics mailing list
>> Analytics@lists.wikimedia.org
>> https://lists.wikimedia.org/mailman/listinfo/analytics
>>
> ___
> Analytics mailing list
> Analytics@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/analytics
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


Re: [Analytics] SparkContext stopped and cannot be restarted

2020-02-05 Thread Neil Shah-Quinn
Whoa—I just got the same stopped SparkContext error on the query even after
restarting the notebook, without an intermediate Java heap space error.
That seems very strange to me.

On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn 
wrote:

> Hey there!
>
> I was running SQL queries via PySpark (using the wmfdata package
> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) on
> SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java
> heap space".
>
> After that, when I tried to call the spark.sql function again (via
> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
> call methods on a stopped SparkContext."
>
> When I tried to create a new Spark context using
> SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
> or directly), it returned a SparkContent object properly, but calling the
> object's sql function still gave the "stopped SparkContext error".
>
> Any idea what's going on? I assume restarting the notebook kernel would
> take care of the problem, but it seems like there has to be a better way to
> recover.
>
> Thank you!
>
>
>
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics


[Analytics] SparkContext stopped and cannot be restarted

2020-02-05 Thread Neil Shah-Quinn
Hey there!

I was running SQL queries via PySpark (using the wmfdata package
) on
SWAP when one of my queries failed with "java.lang.OutofMemoryError: Java
heap space".

After that, when I tried to call the spark.sql function again (via
wmfdata.hive.run), it failed with "java.lang.IllegalStateException: Cannot
call methods on a stopped SparkContext."

When I tried to create a new Spark context using
SparkSession.builder.getOrCreate (whether using wmfdata.spark.get_session
or directly), it returned a SparkContent object properly, but calling the
object's sql function still gave the "stopped SparkContext error".

Any idea what's going on? I assume restarting the notebook kernel would
take care of the problem, but it seems like there has to be a better way to
recover.

Thank you!
___
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics