Can you share your spark / mesos configurations and the spark job? I'd like to repro it.
Tim > On Aug 20, 2014, at 12:39 PM, Cody Koeninger <c...@koeninger.org> wrote: > > I'm seeing situations where starting e.g. a 4th spark job on Mesos results in > none of the jobs making progress. This happens even with --executor-memory > set to values that should not come close to exceeding the availability per > node, and even if the 4th job is doing something completely trivial (e.g. > parallelize 1 to 10000 and sum). Killing one of the jobs typically allows > the others to start proceeding. > > While jobs are hung, I see the following in mesos master logs: > > I0820 19:28:02.651296 24666 master.cpp:2282] Sending 7 offers to framework > 20140820-170154-1315739402-5050-24660-0020 > I0820 19:28:02.654502 24668 master.cpp:1578] Processing reply for offers: [ > 20140820-170154-1315739402-5050-24660-96624 ] on slave > 20140724-150750-1315739402-5050-25405-6 (dn-04) for framework > 20140820-170154-1315739402-5050-24660-0020 > I0820 19:28:02.654722 24668 hierarchical_allocator_process.hpp:590] Framework > 20140820-170154-1315739402-5050-24660-0020 filtered slave > 20140724-150750-1315739402-5050-25405-6 for 1secs > > Am I correctly interpreting that to mean that spark is being offered > resources, but is rejecting them? Is there a way (short of patching spark to > add more logging) to figure out why resources are being rejected? > > This is on the default fine-grained mode. >