There is no workaround without code change in Tez.


The simplest code change would be to make this behavior configurable and
have the current behavior as default.



Btw, you can also try the session min held containers configuration that
was recently added. This ensures that your session will retain some minimum
resources. You can use the session min/max timeouts to decay excess
containers.



Bikas



*From:* Thaddeus Diamond [mailto:[email protected]]
*Sent:* Wednesday, July 30, 2014 8:51 PM
*To:* [email protected]
*Subject:* Re: Reusing Containers Of Failed Tasks



I see.  Is there a manual workaround you suggest for this?



The motivation is this: I have an application with low latency and max
concurrency SLAs.  The way we are trying to solve this with Tez is to keep
an application-level pool of Tez sessions and configure each to have
long-lived containers.  When users submit DAGs the application grabs an
idle Tez session from the pool and submits to that one. After the DAG
completes (successful or not) it is returned to the pool in an idle state.



If a session gets returned to the pool but no containers are spun up in it
because the DAG failed, I will fail to meet my SLAs on the next DAG
submission.



On Wed, Jul 30, 2014 at 8:05 PM, Bikas Saha <[email protected]> wrote:

Currently, failed tasks make the JVM exit. There is no work around for
that. Before we can change that we would need to be able to check the task
execution is isolated such that a task failure does not end up “corrupting”
the host.



Bikas



*From:* Thaddeus Diamond [mailto:[email protected]]
*Sent:* Wednesday, July 30, 2014 3:15 PM
*To:* [email protected]
*Subject:* Reusing Containers Of Failed Tasks



Hi,



I turned on container reuse and upped the time that containers linger after
task vertex completion (tez.am.container.session.delay-allocation-millis),
but I'm still having an issue.  Sometimes, the Processor I created will
fail due to application logic in one DAG but not the next. The trivial
example is:



class MyProcessor implements LogicalIOProcessor {

  // Other non-application logic code

  public void run(...) {

    if (new Random().nextBoolean()) {

      throw new FooBarBazException();

    }

  }

}



In this case I don't want the task JVM to be deallocated because it was
application logic that caused the failure and next time I start a DAG I
will have the long JVM task startup delay.



I see the following code in the source (TaskScheduler#deallocateTask(...))
that I think is the cause of this:



       if (!taskSucceeded || !shouldReuseContainers) {

          if (LOG.isDebugEnabled()) {

            LOG.debug("Releasing container, containerId=" +
container.getId()

                + ", taskSucceeded=" + taskSucceeded

                + ", reuseContainersFlag=" + shouldReuseContainers);

          }

          releaseContainer(container.getId());

        }



Is this something that can be fixed in master? Or is there a
workaround/conf I can set to get this working?



Thanks,

Thad


CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.

-- 
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to 
which it is addressed and may contain information that is confidential, 
privileged and exempt from disclosure under applicable law. If the reader 
of this message is not the intended recipient, you are hereby notified that 
any printing, copying, dissemination, distribution, disclosure or 
forwarding of this communication is strictly prohibited. If you have 
received this communication in error, please contact the sender immediately 
and delete it from your system. Thank You.

Reply via email to