Re: samza job start takes 20 minutes to figure out the Checkpointed offset

Liu Bo Sun, 08 May 2016 21:00:55 -0700

Thanks for the reply Jagadish, it's really a good point.

I check the kafka configuration, log compaction is enabled, but other
compaction related settings are all default at the server side.


The checkpoint topic has a topic level config which set the cleanup policy
to compact (done by samza while creating the checkpoint topic I believe):

[services@crawler12 bin]$ kafka-topics.sh --zookeeper localhost:2182
--topic __samza_checkpoint_ver_1_for_xxx-remote_1 --describe
Topic:__samza_checkpoint_ver_1_for_xxx-remote_1 PartitionCount:1
ReplicationFactor:2 *Configs:segment.bytes=26214400,cleanup.policy=compact*
Topic: __samza_checkpoint_ver_1_for_xxx-remote_1 Partition: 0 Leader:
2 Replicas:
2,0 Isr: 2,0


The other thing is log retention is set to 24 hour or 30GB. But seems not
working for checkpoint topic. As all the *.log file are there unlike the
data topic which only has recent ones.

The checkpoint log size is 3.6GB all together. I think reducing it to MB
level would help a lot, we currently producing about 100MB checkpoint data
each day.

I am going to dig further on this (never config compaction before), and
your suggestions would be grateful.



On 8 May 2016 at 02:25, Jagadish Venkatraman <jagadish1...@gmail.com> wrote:

> Hi,
>
> Is your checkpoint topic log compacted? That may help in reducing the size
> of the log.
>
> On Sat, May 7, 2016 at 2:35 AM, Liu Bo <diabl...@gmail.com> wrote:
>
> > Hi group
> >
> > I've got this problem while doing maintenance of our samza job, that job
> > has been running for a while and processed nearly 350 million data,
> >
> > we have 40 partitions with 10s checkpoint commit interval, this leads to
> a
> > checkpoint topic with 18308702 messages. 10s interval is probably too
> > short, I just increased it.
> >
> > Job starting hangs for 20 mins to get the Checkpointed offset, and I
> think
> > each of the job are trying to read all the checkpoint message, and kafka
> > took all 100MB server bandwidth at that time.  Here's the log for one of
> > the machine:
> >
> > 2016-05-07 16:21:29 KafkaCheckpointManager [INFO] Reading checkpoint
> > for taskName Partition 18
> >
> > 2016-05-07 16:21:29 KafkaCheckpointManager [INFO] No TaskName to
> > checkpoint mapping provided.  Reading for first time.
> >
> > 2016-05-07 16:21:29 KafkaCheckpointManager [INFO] Connecting to leader
> > 10.111.0.132:9093 for topic __samza_checkpoint_ver_1_for_xxx-remote_1
> > and to *fetch all checkpoint messages*.
> >
> > 2016-05-07 16:21:29 KafkaCheckpointManager [INFO] Got offset 0 for
> > topic __samza_checkpoint_ver_1_for_xxx-remote_1 and partition 0.
> > Attempting to fetch messages for checkpoint log.
> >
> > 2016-05-07 *16:21:29* KafkaCheckpointManager [INFO] Get latest offset
> > 18308702 for topic __samza_checkpoint_ver_1_for_xxx-remote_1 and
> > partition 0.
> >
> > 2016-05-07 *16:43:42* KafkaCheckpointManager [INFO] Got checkpoint
> > state for taskName Partition 18: Checkpoint
> > [offsets={SystemStreamPartition [message, xxx_nlp, 18]=9350574}]
> >
> > 2016-05-07 16:43:42 OffsetManager [INFO] Checkpointed offset is
> > currently 9350574 for SystemStreamPartition [message, xxx_nlp, 18]
> >
> >
> > The log tells that I missed taskname to checkpoint mapping, would this be
> > the cause?  here's my checkpoint config:
> >
> > job.name=xxx_remote
> >
> >
> >
> systems.message.samza.factory=org.apache.samza.system.kafka.KafkaSystemFactory
> >
> >
> >
> task.checkpoint.factory=org.apache.samza.checkpoint.kafka.KafkaCheckpointManagerFactory
> > task.checkpoint.system=message
> > task.commit.ms=10000
> > task.checkpoint.replication.factor=2
> >
> >
> > My restarting step is like :
> >
> > 1. update the samza package at hdfs
> > 2. yarn kill the job
> > 3. start the job using samza run-job scripts
> >
> >
> > I have done this for several times, and only this time cause this problem
> > as far as I cam remember. Is there anything wrong with the step or the
> > config? I'm suspecting that I tried to start the job before previous one
> > could shutdown safely.
> >
> > Your help will be much of my appreciation.
> >
> > BTW: I think the checkpoint manager could start from the end of the
> > checkpoint, and try to "look backward" for the offset for current
> partition
> > instead of reading all of them.
> >
> > --
> > All the best
> >
> > Liu Bo
> >
>
>
>
> --
> Jagadish V,
> Graduate Student,
> Department of Computer Science,
> Stanford University
>



-- 
All the best

Liu Bo

Re: samza job start takes 20 minutes to figure out the Checkpointed offset

Reply via email to