At least some of the jobs are typically doing work that would make it
difficult to share, e.g. accessing hdfs.  I'll see if I can get a smaller
reproducible case.


On Wed, Aug 20, 2014 at 3:23 PM, Timothy Chen <t...@mesosphere.io> wrote:

> Can you share your spark / mesos configurations and the spark job? I'd
> like to repro it.
>
> Tim
>
> > On Aug 20, 2014, at 12:39 PM, Cody Koeninger <c...@koeninger.org> wrote:
> >
> > I'm seeing situations where starting e.g. a 4th spark job on Mesos
> results in none of the jobs making progress.  This happens even with
> --executor-memory set to values that should not come close to exceeding the
> availability per node, and even if the 4th job is doing something
> completely trivial (e.g. parallelize 1 to 10000 and sum).  Killing one of
> the jobs typically allows the others to start proceeding.
> >
> > While jobs are hung, I see the following in mesos master logs:
> >
> > I0820 19:28:02.651296 24666 master.cpp:2282] Sending 7 offers to
> framework 20140820-170154-1315739402-5050-24660-0020
> > I0820 19:28:02.654502 24668 master.cpp:1578] Processing reply for
> offers: [ 20140820-170154-1315739402-5050-24660-96624 ] on slave
> 20140724-150750-1315739402-5050-25405-6 (dn-04) for framework
> 20140820-170154-1315739402-5050-24660-0020
> > I0820 19:28:02.654722 24668 hierarchical_allocator_process.hpp:590]
> Framework 20140820-170154-1315739402-5050-24660-0020 filtered slave
> 20140724-150750-1315739402-5050-25405-6 for 1secs
> >
> > Am I correctly interpreting that to mean that spark is being offered
> resources, but is rejecting them?  Is there a way (short of patching spark
> to add more logging) to figure out why resources are being rejected?
> >
> > This is on the default fine-grained mode.
> >
>

Reply via email to