I think I see what's happening. When there are 8 tasks and I set yarn.container.count=8, then each container is responsible for a single task. However, the systemStreamLagCounts map ( https://github.com/apache/samza/blob/0.9.0/samza-core/src/main/scala/org/apache/samza/system/chooser/BootstrappingChooser.scala#L77) and laggingSystemStreamPartitions ( https://github.com/apache/samza/blob/0.9.0/samza-core/src/main/scala/org/apache/samza/system/chooser/BootstrappingChooser.scala#L83) are configured to track all partitions for the bootstrap topic rather than just the one partition assigned to this task.
Later in the log, we see that the task/container completed bootstrap for it's own partition. 2015-06-21 12:28:55 org.apache.samza.system.chooser.BootstrappingChooser [DEBUG] Bootstrap stream partition is fully caught up: SystemStreamPartition [kafka, deploy.svc.tlrnsZOYQA6wrwAA4FLqZA, 0] but the Bootstrapping Chooser still thinks that the remaining partitions (assigned to other tasks in other containers) need to be completed. JMX at this point shows 7 lagging partitions of the 8 original partition count. I'm wondering why no one has run into this. Doesn't LinkedIn use partitioned bootstrapped topics? Thanks, Roger On Sun, Jun 21, 2015 at 12:22 PM, Roger Hoover <roger.hoo...@gmail.com> wrote: > Hi Yan, > > I've uploaded a file with TRACE level logging here: > http://filebin.ca/261yhsTZcZQZ/samza-container-0.log.gz > > I really appreciate your help as this is a critical issue for me. > > Thanks, > > Roger > > On Fri, Jun 19, 2015 at 12:05 PM, Yan Fang <yanfang...@gmail.com> wrote: > >> Hi Roger, >> >> " but it only spawns one container and still hangs after bootstrap" >> -- this probably is due to your local machine does not have enough >> resource for the second container. Because I checked your log file, each >> container is about 4GB. >> >> "When I run it on our YARN cluster with a single container, it works >> correctly. When I tried it with 5 containers, it gets hung after >> consuming >> the bootstrap topic." >> -- Have you figure it out? I have a looked at your log and also the >> code. My suspect is that, there is a null enveloper somehow blocking the >> process. If you can paste the trace level log, it will be more helpful >> because many logs in chooser are trace level. >> >> Thanks, >> >> Fang, Yan >> yanfang...@gmail.com >> >> On Thu, Jun 18, 2015 at 5:20 PM, Roger Hoover <roger.hoo...@gmail.com> >> wrote: >> >> > I need some help. I have a job which bootstraps one stream and then is >> > supposed to read from two. When I run it on our YARN cluster with a >> single >> > container, it works correctly. When I tried it with 5 containers, it >> gets >> > hung after consuming the bootstrap topic. I ran it with the grid >> script on >> > my laptop (Mac OS X) with yarn.container.count=2 but it only spawns one >> > container and still hangs after bootstrap. >> > >> > Debug logs are here: http://pastebin.com/af3KPvju >> > >> > I looked at JMX metrics and see: >> > - Task Metrics - no value for kafka offset of non-bootstrapped stream >> > - SystemConsumerMetrics >> > - choose null keeps incrementing >> > - ssps-needed-by-chooser 1 >> > - unprocessed-messages 62k >> > - Bootstrapping Chooser >> > - lagging partitions 4 >> > - laggin-batch-streams - 4 >> > - batch-resets - 0 >> > >> > Has anyone seen this or can offer ideas of how to better debug it? >> > >> > I'm using Samza 0.9.0 and YARN 2.4.0. >> > >> > Thanks! >> > >> > Roger >> > >> > >