I think I see what's happening.

When there are 8 tasks and I set yarn.container.count=8, then each
container is responsible for a single task.  However, the
systemStreamLagCounts map (
https://github.com/apache/samza/blob/0.9.0/samza-core/src/main/scala/org/apache/samza/system/chooser/BootstrappingChooser.scala#L77)
and laggingSystemStreamPartitions (
https://github.com/apache/samza/blob/0.9.0/samza-core/src/main/scala/org/apache/samza/system/chooser/BootstrappingChooser.scala#L83)
are configured to track all partitions for the bootstrap topic rather than
just the one partition assigned to this task.

Later in the log, we see that the task/container completed bootstrap for
it's own partition.

2015-06-21 12:28:55 org.apache.samza.system.chooser.BootstrappingChooser
[DEBUG] Bootstrap stream partition is fully caught up:
SystemStreamPartition [kafka, deploy.svc.tlrnsZOYQA6wrwAA4FLqZA, 0]

but the Bootstrapping Chooser still thinks that the remaining partitions
(assigned to other tasks in other containers) need to be completed.  JMX at
this point shows 7 lagging partitions of the 8 original partition count.

I'm wondering why no one has run into this.  Doesn't LinkedIn use
partitioned bootstrapped topics?

Thanks,

Roger

On Sun, Jun 21, 2015 at 12:22 PM, Roger Hoover <roger.hoo...@gmail.com>
wrote:

> Hi Yan,
>
> I've uploaded a file with TRACE level logging here:
> http://filebin.ca/261yhsTZcZQZ/samza-container-0.log.gz
>
> I really appreciate your help as this is a critical issue for me.
>
> Thanks,
>
> Roger
>
> On Fri, Jun 19, 2015 at 12:05 PM, Yan Fang <yanfang...@gmail.com> wrote:
>
>> Hi Roger,
>>
>> " but it only spawns one container and still hangs after bootstrap"
>>     -- this probably is due to your local machine does not have enough
>> resource for the second container. Because I checked your log file, each
>> container is about 4GB.
>>
>> "When I run it on our YARN cluster with a single container, it works
>> correctly.  When I tried it with 5 containers, it gets hung after
>> consuming
>> the bootstrap topic."
>>    -- Have you figure it out? I have a looked at your log and also the
>> code. My suspect is that, there is a null enveloper somehow blocking the
>> process. If you can paste the trace level log, it will be more helpful
>> because many logs in chooser are trace level.
>>
>> Thanks,
>>
>> Fang, Yan
>> yanfang...@gmail.com
>>
>> On Thu, Jun 18, 2015 at 5:20 PM, Roger Hoover <roger.hoo...@gmail.com>
>> wrote:
>>
>> > I need some help.  I have a job which bootstraps one stream and then is
>> > supposed to read from two.  When I run it on our YARN cluster with a
>> single
>> > container, it works correctly.  When I tried it with 5 containers, it
>> gets
>> > hung after consuming the bootstrap topic.  I ran it with the grid
>> script on
>> > my laptop (Mac OS X) with yarn.container.count=2 but it only spawns one
>> > container and still hangs after bootstrap.
>> >
>> > Debug logs are here: http://pastebin.com/af3KPvju
>> >
>> > I looked at JMX metrics and see:
>> > - Task Metrics - no value for kafka offset of non-bootstrapped stream
>> > -  SystemConsumerMetrics
>> >     - choose null keeps incrementing
>> >      - ssps-needed-by-chooser 1
>> >       - unprocessed-messages 62k
>> > - Bootstrapping Chooser
>> >   - lagging partitions 4
>> >   - laggin-batch-streams - 4
>> >   - batch-resets - 0
>> >
>> > Has anyone seen this or can offer ideas of how to better debug it?
>> >
>> > I'm using Samza 0.9.0 and YARN 2.4.0.
>> >
>> > Thanks!
>> >
>> > Roger
>> >
>>
>
>

Reply via email to