Hey Davies,

Sure thing, it's filed here now:
https://issues.apache.org/jira/browse/SPARK-5395

As far as a repro goes, what is a normal number of workers I should expect?
Even shortly after kicking the job off, I see workers in the double-digits
per container. Here's an example using pstree on a worker node (the worker
node runs 16 containers, i.e. the java--bash branches below). The initial
Python process per container is forked between 7 and 33 times.

     ├─java─┬─bash───java─┬─python───22*[python]
     │      │             └─89*[{java}]
     │      ├─bash───java─┬─python───13*[python]
     │      │             └─80*[{java}]
     │      ├─bash───java─┬─python───9*[python]
     │      │             └─78*[{java}]
     │      ├─bash───java─┬─python───24*[python]
     │      │             └─91*[{java}]
     │      ├─3*[bash───java─┬─python───19*[python]]
     │      │                └─86*[{java}]]
     │      ├─bash───java─┬─python───25*[python]
     │      │             └─90*[{java}]
     │      ├─bash───java─┬─python───33*[python]
     │      │             └─98*[{java}]
     │      ├─bash───java─┬─python───7*[python]
     │      │             └─75*[{java}]
     │      ├─bash───java─┬─python───15*[python]
     │      │             └─80*[{java}]
     │      ├─bash───java─┬─python───18*[python]
     │      │             └─84*[{java}]
     │      ├─bash───java─┬─python───10*[python]
     │      │             └─85*[{java}]
     │      ├─bash───java─┬─python───26*[python]
     │      │             └─90*[{java}]
     │      ├─bash───java─┬─python───17*[python]
     │      │             └─85*[{java}]
     │      ├─bash───java─┬─python───24*[python]
     │      │             └─92*[{java}]
     │      └─265*[{java}]

Each container has 2 cores in case that makes a difference.

Thank you!
-Sven

On Fri, Jan 23, 2015 at 11:52 PM, Davies Liu <dav...@databricks.com> wrote:

> It should be a bug, the Python worker did not exit normally, could you
> file a JIRA for this?
>
> Also, could you show how to reproduce this behavior?
>
> On Fri, Jan 23, 2015 at 11:45 PM, Sven Krasser <kras...@gmail.com> wrote:
> > Hey Adam,
> >
> > I'm not sure I understand just yet what you have in mind. My takeaway
> from
> > the logs is that the container actually was above its allotment of about
> > 14G. Since 6G of that are for overhead, I assumed there to be plenty of
> > space for Python workers, but there seem to be more of those than I'd
> > expect.
> >
> > Does anyone know if that is actually the intended behavior, i.e. in this
> > case over 90 Python processes on a 2 core executor?
> >
> > Best,
> > -Sven
> >
> >
> > On Fri, Jan 23, 2015 at 10:04 PM, Adam Diaz <adam.h.d...@gmail.com>
> wrote:
> >>
> >> Yarn only has the ability to kill not checkpoint or sig suspend.  If you
> >> use too much memory it will simply kill tasks based upon the yarn
> config.
> >> https://issues.apache.org/jira/browse/YARN-2172
> >>
> >>
> >> On Friday, January 23, 2015, Sandy Ryza <sandy.r...@cloudera.com>
> wrote:
> >>>
> >>> Hi Sven,
> >>>
> >>> What version of Spark are you running?  Recent versions have a change
> >>> that allows PySpark to share a pool of processes instead of starting a
> new
> >>> one for each task.
> >>>
> >>> -Sandy
> >>>
> >>> On Fri, Jan 23, 2015 at 9:36 AM, Sven Krasser <kras...@gmail.com>
> wrote:
> >>>>
> >>>> Hey all,
> >>>>
> >>>> I am running into a problem where YARN kills containers for being over
> >>>> their memory allocation (which is about 8G for executors plus 6G for
> >>>> overhead), and I noticed that in those containers there are tons of
> >>>> pyspark.daemon processes hogging memory. Here's a snippet from a
> container
> >>>> with 97 pyspark.daemon processes. The total sum of RSS usage across
> all of
> >>>> these is 1,764,956 pages (i.e. 6.7GB on the system).
> >>>>
> >>>> Any ideas what's happening here and how I can get the number of
> >>>> pyspark.daemon processes back to a more reasonable count?
> >>>>
> >>>> 2015-01-23 15:36:53,654 INFO  [Reporter] yarn.YarnAllocationHandler
> >>>> (Logging.scala:logInfo(59)) - Container marked as failed:
> >>>> container_1421692415636_0052_01_000030. Exit status: 143. Diagnostics:
> >>>> Container
> [pid=35211,containerID=container_1421692415636_0052_01_000030] is
> >>>> running beyond physical memory limits. Current usage: 14.9 GB of 14.5
> GB
> >>>> physical memory used; 41.3 GB of 72.5 GB virtual memory used. Killing
> >>>> container.
> >>>> Dump of the process-tree for container_1421692415636_0052_01_000030 :
> >>>>    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS)
> >>>> SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES)
> FULL_CMD_LINE
> >>>>    |- 54101 36625 36625 35211 (python) 78 1 332730368 16834 python -m
> >>>> pyspark.daemon
> >>>>    |- 52140 36625 36625 35211 (python) 58 1 332730368 16837 python -m
> >>>> pyspark.daemon
> >>>>    |- 36625 35228 36625 35211 (python) 65 604 331685888 17694 python
> -m
> >>>> pyspark.daemon
> >>>>
> >>>>    [...]
> >>>>
> >>>>
> >>>> Full output here:
> https://gist.github.com/skrasser/e3e2ee8dede5ef6b082c
> >>>>
> >>>> Thank you!
> >>>> -Sven
> >>>>
> >>>> --
> >>>> krasser
> >>>
> >>>
> >
> >
> >
> > --
> > http://sites.google.com/site/krasser/?utm_source=sig
>



-- 
http://sites.google.com/site/krasser/?utm_source=sig

Reply via email to