Another update: I'm continuing to encounter these Spark errors and have trouble recovering from them, even when I use proper settings. I've filed T245713 <https://phabricator.wikimedia.org/T245713> to discuss this further. The specific errors and behavior I'm seeing (for example, whether explicitly calling session.stop allows a new functioning session to be created) are not consistent, so I'm still trying to make sense of it.
I would greatly appreciate any input or help, even if it's identifying places where my description doesn't make sense. <https://phabricator.wikimedia.org/T245713> <https://phabricator.wikimedia.org/T245713> On Wed, 19 Feb 2020 at 13:35, Neil Shah-Quinn <nshahqu...@wikimedia.org> wrote: > Bump! > > Analytics team, I'm eager to have input from y'all about the best Spark > settings to use. > > On Fri, 14 Feb 2020 at 18:30, Neil Shah-Quinn <nshahqu...@wikimedia.org> > wrote: > >> I ran into this problem again, and I found that neither session.stop or >> newSession got rid of the error. So it's still not clear how to recover >> from a crashed(?) Spark session. >> >> On the other hand, I did figure out why my sessions were crashing in the >> first place, so hopefully recovering from that will be a rare need. The >> reason is that wmfdata doesn't modify >> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/spark.py#L60> >> the default Spark when it starts a new session, which was (for example) >> causing it to start executors with only ~400 MiB of memory each. >> >> I'm definitely going to change that, but it's not completely clear what >> the recommended settings for our cluster are. I cataloged the different >> recommendations at https://phabricator.wikimedia.org/T245097, and it >> would very helpful if one of y'all could give some clear recommendations >> about what the settings should be for local SWAP, YARN, and "large" YARN >> jobs. For example, is it important to increase spark.sql.shuffle.partitions >> for YARN jobs? Is it reasonable to use 8 GiB of driver memory for a local >> job when the SWAP servers only have 64 GiB total? >> >> Thank you! >> >> >> >> >> On Fri, 7 Feb 2020 at 06:53, Andrew Otto <o...@wikimedia.org> wrote: >> >>> Hm, interesting! I don't think many of us have used >>> SparkSession.builder.getOrCreate repeatedly in the same process. What >>> happens if you manually stop the spark session first, (session.stop() >>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.stop>?) >>> or maybe try to explicitly create a new session via newSession() >>> <https://spark.apache.org/docs/latest/api/python/pyspark.sql.html?highlight=sparksession#pyspark.sql.SparkSession.newSession> >>> ? >>> >>> On Thu, Feb 6, 2020 at 7:31 PM Neil Shah-Quinn <nshahqu...@wikimedia.org> >>> wrote: >>> >>>> Hi Luca! >>>> >>>> Those were separate Yarn jobs I started later. When I got this error, I >>>> found that the Yarn job corresponding to the SparkContext was marked as >>>> "successful", but I still couldn't get SparkSession.builder.getOrCreate to >>>> open a new one. >>>> >>>> Any idea what might have caused that or how I could recover without >>>> restarting the notebook, which could mean losing a lot of in-progress work? >>>> I had already restarted that kernel so I don't know if I'll encounter this >>>> problem again. If I do, I'll file a task. >>>> >>>> On Wed, 5 Feb 2020 at 23:24, Luca Toscano <ltosc...@wikimedia.org> >>>> wrote: >>>> >>>>> Hey Neil, >>>>> >>>>> there were two Yarn jobs running related to your notebooks, I just >>>>> killed them, let's see if it solves the problem (you might need to restart >>>>> again your notebook). If not, let's open a task and investigate :) >>>>> >>>>> Luca >>>>> >>>>> Il giorno gio 6 feb 2020 alle ore 02:08 Neil Shah-Quinn < >>>>> nshahqu...@wikimedia.org> ha scritto: >>>>> >>>>>> Whoa—I just got the same stopped SparkContext error on the query even >>>>>> after restarting the notebook, without an intermediate Java heap space >>>>>> error. That seems very strange to me. >>>>>> >>>>>> On Wed, 5 Feb 2020 at 16:09, Neil Shah-Quinn < >>>>>> nshahqu...@wikimedia.org> wrote: >>>>>> >>>>>>> Hey there! >>>>>>> >>>>>>> I was running SQL queries via PySpark (using the wmfdata package >>>>>>> <https://github.com/neilpquinn/wmfdata/blob/master/wmfdata/hive.py>) >>>>>>> on SWAP when one of my queries failed with "java.lang.OutofMemoryError: >>>>>>> Java heap space". >>>>>>> >>>>>>> After that, when I tried to call the spark.sql function again (via >>>>>>> wmfdata.hive.run), it failed with "java.lang.IllegalStateException: >>>>>>> Cannot >>>>>>> call methods on a stopped SparkContext." >>>>>>> >>>>>>> When I tried to create a new Spark context using >>>>>>> SparkSession.builder.getOrCreate (whether using >>>>>>> wmfdata.spark.get_session >>>>>>> or directly), it returned a SparkContent object properly, but calling >>>>>>> the >>>>>>> object's sql function still gave the "stopped SparkContext error". >>>>>>> >>>>>>> Any idea what's going on? I assume restarting the notebook kernel >>>>>>> would take care of the problem, but it seems like there has to be a >>>>>>> better >>>>>>> way to recover. >>>>>>> >>>>>>> Thank you! >>>>>>> >>>>>>> >>>>>>> _______________________________________________ >>>>>> Analytics mailing list >>>>>> Analytics@lists.wikimedia.org >>>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>>> >>>>> _______________________________________________ >>>>> Analytics mailing list >>>>> Analytics@lists.wikimedia.org >>>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>>> >>>> _______________________________________________ >>>> Analytics mailing list >>>> Analytics@lists.wikimedia.org >>>> https://lists.wikimedia.org/mailman/listinfo/analytics >>>> >>> _______________________________________________ >>> Analytics mailing list >>> Analytics@lists.wikimedia.org >>> https://lists.wikimedia.org/mailman/listinfo/analytics >>> >>
_______________________________________________ Analytics mailing list Analytics@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/analytics