Great and detailed report! Really appreciate it! -Yi
On Tue, Jun 18, 2019 at 2:37 PM Malcolm McFarland <mmcfarl...@cavulus.com> wrote: > Just want to follow up on this, for anybody that might be trying to do > something similar. > > There are two things that were getting in the way of us using YARN+Samza on > ECS: 1) YARN needs to be able to resolve its hostname to something that's > publicly available; and 2) Samza needs to be able to open connections on > arbitrary ports in the 30000+ range. > > Docker confounds each of these in a different way. For the first, Docker's > hostname inside of the container is an arbitrary hash, and this is what > java.net.InetAddress will resolve to. I took Rayman's suggestion and used > dnsmasq to create a local CNAME mapping inside the container, mapping the > local "hostname" to one that is publicly available. This should work well > for any Docker-hosted JVM app relying on java.net.InetAddress. > > Docker also only allows 100 ports to be publicly exposed, and there is no > configuration option in Samza to specify what the range of ports will be. > The way we worked around this on ECS was to create an elastic network > interface (ENI) for each of the node manager containers. Although I can't > find any documentation on this, I suspect that Fargate does this by > default, as the whole point of that service is to bypass the restrictions > placed on containers running on EC2 instances. With the ENI, we no longer > had to explicitly expose any ports; all ports will be available if the > security group allows. > > As an aside, you might wonder: why not just run these on Fargate? Well, > Fargate only allows 10GB of storage (this can be extended a small amount > via an ephemeral mounted volume but seemingly not enough to satisfy YARN's > VM requirements). > > Hth, and thanks for everybody's patience, > > Malcolm McFarland > Cavulus > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > unauthorized or improper disclosure, copying, distribution, or use of the > contents of this message is prohibited. The information contained in this > message is intended only for the personal and confidential use of the > recipient(s) named above. If you have received this message in error, > please notify the sender immediately and delete the original message. > > > On Fri, May 31, 2019 at 3:08 PM rayman preet <rayman7...@gmail.com> wrote: > > > Apart from /etc/hosts and /bin/hostname the only other relevant place > might > > be > > to modify values in /etc/resolv.conf, to point to, e.g., a dnsmasq > > instance. > > > > On Fri, May 31, 2019 at 2:43 PM Malcolm McFarland < > mmcfarl...@cavulus.com> > > wrote: > > > > > Hey Rayman, > > > > > > The ops group and I went through the configuration today and observed > the > > > YARN containers as they were coming up. We seem to have found the root > of > > > the problem, and I'm putting this out there for anybody else that's > > trying > > > to do something similar on AWS ECS: > > > > > > The ECS container instances set their hostname to the container ID on > > > startup (ie 717b6f75aaf8), and this looks like it's interfering with > the > > > YARN container startup process. This *seems* to be corroborated in that > > > containers that start on the same host as their AM look to be starting > > fine > > > (ie they can locally resolve their IP address correctly), but > containers > > > starting on other hosts don't seem to be. We were *not* having this > > problem > > > on Fargate, and my only guess is that, given Fargate's intended use > case > > as > > > a replicated-services-in-the-cloud environment, AWS sets the hostname > for > > > Fargate-bound Docker containers on launch (ie > > > ip-10-#-#-#.us-west-#.internal.local or whatever). (As a side note, we > > > probably would have stuck with Fargate and not run into this problem, > but > > > Fargate instances are only allowed 10GB of disk space, and this wasn't > > > enough for YARN's VM requirements.) > > > > > > I've been fishing around for a way to get Samza to resolve the hostname > > to > > > something more publicly-available. I've thus far tried a) changing the > > > /etc/hosts file, and b) replacing the /bin/hostname binary in the > > container > > > with a static script, but neither of these options seem to have an > effect > > > on Java's DNS resolution. Two further options I can think of are: > > > > > > - find some place in the Samza configuration where the hostname can be > > set > > > explicitly; or > > > - change just the right piece of information in the system so that > > > java.net.InetAddress will resolve the localhost to something other than > > > what's returned from /bin/hostname (I'm guessing it uses gethostname() > on > > > Ubuntu, could be wrong). > > > > > > Anybody ideas? > > > > > > Cheers, > > > Malcolm McFarland > > > Cavulus > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > unauthorized or improper disclosure, copying, distribution, or use of > the > > > contents of this message is prohibited. The information contained in > this > > > message is intended only for the personal and confidential use of the > > > recipient(s) named above. If you have received this message in error, > > > please notify the sender immediately and delete the original message. > > > > > > Malcolm McFarland > > > Cavulus > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > unauthorized or improper disclosure, copying, distribution, or use of > the > > > contents of this message is prohibited. The information contained in > this > > > message is intended only for the personal and confidential use of the > > > recipient(s) named above. If you have received this message in error, > > > please notify the sender immediately and delete the original message. > > > > > > > > > On Fri, May 31, 2019 at 9:27 AM rayman preet <rayman7...@gmail.com> > > wrote: > > > > > > > Yes I think your hunch is right. Each container queries the AM over > > HTTP > > > to > > > > obtain > > > > the jobModel that it is supposed to run. The AM runs a HTTP server > > > usually > > > > on > > > > a dynamically allocated free port on the machine it's running on. > > > > So its possible that a firewall rule blocks the container when it > tries > > > to > > > > reach this port > > > > on the AM's machine? > > > > > > > > -- > > > > thanks > > > > rayman > > > > > > > > On Thu, May 30, 2019 at 5:30 PM Malcolm McFarland < > > > mmcfarl...@cavulus.com> > > > > wrote: > > > > > > > > > Thanks for the image, appreciate you taking the effort to do that! > > I'm > > > > > still hitting this wall. The AM will launch the container, the > > > container > > > > > will go from "accepted" to "running", but there will be no output > > from > > > > the > > > > > container (I'm piping all of the Samza, org.apache, org.kafka, and > > our > > > > own > > > > > application's logging output to a Kafka topic). During these > periods, > > > the > > > > > container will hang out at ~100MB/8GB memory usage and stall. > There's > > > no > > > > > error output when this happens; it just kind of stops. My suspicion > > is > > > > that > > > > > our Ops group has a firewall rule up that's interfering with > this,or > > > > maybe > > > > > just isn't white-listing a port correctly, and if I could identify > > > where > > > > > the application is stalling, it'd probably help to narrow down the > > > > > possibilities. > > > > > > > > > > Cheers, > > > > > Malcolm McFarland > > > > > Cavulus > > > > > > > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. Any > > > > > unauthorized or improper disclosure, copying, distribution, or use > of > > > the > > > > > contents of this message is prohibited. The information contained > in > > > this > > > > > message is intended only for the personal and confidential use of > the > > > > > recipient(s) named above. If you have received this message in > error, > > > > > please notify the sender immediately and delete the original > message. > > > > > > > > > > > > > > > On Thu, May 30, 2019 at 1:39 PM rayman preet <rayman7...@gmail.com > > > > > > wrote: > > > > > > > > > > > I uploaded the image here: > > > > > > https://www.dropbox.com/s/rv57v165ysp12c5/samza%20flow.png?dl=0 > > > > > > > > > > > > Are you still running into this issue? > > > > > > Is there anything in the container's log that shows any > > > > > exceptions/errors. > > > > > > > > > > > > On Wed, May 22, 2019 at 10:15 PM Malcolm McFarland < > > > > > mmcfarl...@cavulus.com > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > Hey rayman, > > > > > > > > > > > > > > What it looks like is that the AM has started, the container > has > > > > > started, > > > > > > > but, ie, here will be the last messages I see in the Samza > logs: > > > > > > > > > > > > > > 2019-05-23T05:10:45.048Z INFO Making a request for > > > ANY_HOST > > > > > > > 2019-05-23T05:10:45.057Z INFO Starting the container > > > > > allocator > > > > > > > thread > > > > > > > 2019-05-23T05:10:47.098Z INFO Received new token for > : > > > > > > > <valid_host>:8032 > > > > > > > 2019-05-23T05:10:47.102Z INFO Container allocated > from > > RM > > > > on > > > > > > > <same_valid_host> > > > > > > > 2019-05-23T05:10:47.105Z INFO Container allocated > from > > RM > > > > on > > > > > > > <same_valid_host> > > > > > > > > > > > > > > At this point, it seems to stall, and no more output is > produced. > > > > > > > > > > > > > > Also, I couldn't see you diagram (it's possible my company's > > email > > > > > > filters > > > > > > > attachments); can I see that on the web anywhere? > > > > > > > > > > > > > > Cheers, > > > > > > > Malcolm > > > > > > > > > > > > > > On Wed, May 22, 2019 at 4:30 PM rayman preet < > > rayman7...@gmail.com > > > > > > > > > > wrote: > > > > > > > > > > > > > > > Hi Malcolm, > > > > > > > > > > > > > > > > This figure (attached) gives an overview of the flow. Is > > > > > > > > this something you were looking for? > > > > > > > > > > > > > > > > Also, by "don't fully start up" do you mean that > > > > > > > > applications are missing some containers (but the > > > ApplicationMaster > > > > > is > > > > > > > > running)? > > > > > > > > Or the application is missing entirely. > > > > > > > > > > > > > > > > -- > > > > > > > > thanks > > > > > > > > rayman > > > > > > > > [image: Samza Job Launch Sequence.png] > > > > > > > > > > > > > > > > On Tue, May 21, 2019 at 3:58 PM Malcolm McFarland < > > > > > > > mmcfarl...@cavulus.com> > > > > > > > > wrote: > > > > > > > > > > > > > > > >> Hey Folks, > > > > > > > >> > > > > > > > >> I'm still trying to pin down why these applications are > > > sometimes > > > > > not > > > > > > > >> starting. Everything looks fine in the YARN web UI and in > the > > > > > > > >> immediately available logs, but the applications don't > always > > > > fully > > > > > > > >> start up. Does anybody have a rundown about how to trace the > > > Samza > > > > > > > >> startup process on a YARN cluster, from Accepted status, to > > > > > > > >> localization, to the application master startup, to the > actual > > > > > > > >> application's startup? > > > > > > > >> > > > > > > > >> Cheers, > > > > > > > >> Malcolm > > > > > > > >> > > > > > > > >> -- > > > > > > > >> Malcolm McFarland > > > > > > > >> Cavulus > > > > > > > >> > > > > > > > >> > > > > > > > >> This correspondence is from HealthPlanCRM, LLC, d/b/a > Cavulus. > > > Any > > > > > > > >> unauthorized or improper disclosure, copying, distribution, > or > > > use > > > > > of > > > > > > > >> the contents of this message is prohibited. The information > > > > > contained > > > > > > > >> in this message is intended only for the personal and > > > confidential > > > > > use > > > > > > > >> of the recipient(s) named above. If you have received this > > > message > > > > > in > > > > > > > >> error, please notify the sender immediately and delete the > > > > original > > > > > > > >> message. > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > thanks > > > > > > > > rayman > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > Malcolm McFarland > > > > > > > Cavulus > > > > > > > 1-800-760-6915 > > > > > > > mmcfarl...@cavulus.com > > > > > > > > > > > > > > > > > > > > > This correspondence is from HealthPlanCRM, LLC, d/b/a Cavulus. > > Any > > > > > > > unauthorized or improper disclosure, copying, distribution, or > > use > > > of > > > > > the > > > > > > > contents of this message is prohibited. The information > contained > > > in > > > > > this > > > > > > > message is intended only for the personal and confidential use > of > > > the > > > > > > > recipient(s) named above. If you have received this message in > > > error, > > > > > > > please notify the sender immediately and delete the original > > > message. > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > thanks > > > > > > rayman > > > > > > > > > > > > > > > > > > > > > > > -- > > > > thanks > > > > rayman > > > > > > > > > > > > > -- > > thanks > > rayman > > >