Thanks Andrew for the detailed response

We’re having a replication factor of 3 so we’re safe there. What do you
recommend for min.insync.replicas, acks and log flush internal?

I was worried about region failures or someone that goes in and deletes our
instances and associated volumes (that’s a kind of disaster).

- Mirror Maker is great, but does it keep the topic configuration,
partitioning and offsets? We basically want our consumers to keep on
working like they used to in case we lose a whole cluster.

- Secor as you said, will allow me to backup the data, but not to reload it
to a cluster, so I don’t think it fits our exact purpose

- EBS snapshots guarantee some kind of point in time recovery (although
some data may be lost as you said). Shutting down the brokers before the
EBS backup, one at a time, sounds like an option, the backup will just take
time to happen over the whole cluster I guess?

I appreciate your feedback and look forward to hearing from you

Regards,
Stephane





On 22 December 2016 at 5:40:30 pm, Andrew Clarkson (
andrew.clark...@rallyhealth.com) wrote:

Hi Stephane,

I say this not to be condescending in any way, but simple replication
*might* cover your needs. This will cover most node failures (causing
unclean shutdown) like disk or power failure. This assumes that one of the
replicas of your data survives (see the configs min.insync.replicas, acks,
and log.flush.interval.*). Making sure that you have the correct ack'ing
and replication strategy will likely cover a lot of the failure/recovery
use cases.

If you need better recovery/availability guarantees than simple
replication, the de facto mechanism is "mirroring
<https://kafka.apache.org/documentation.html#basic_ops_mirror_maker>" using
a tool called "mirror maker". This would cover cases where an entire
cluster crashed (like an AWS region being down) or other catastrophic
failures. This is the preferred way to do multi-data center (multi-region)
replication.

Back to EBS snapshots. From what I understand, snapshotting the file system
won't give you a full picture of what's going on because brokers flush the
logs infrequently and, as you mentioned, leave logs in a "corrupted" state.

If you need a persistent record in order to rerun expired data (see the
configs log.retention.*), you might want to look at a tool like Secor
<https://github.com/pinterest/secor>. Secor will write all messages to an
S3 bucket from which you could rerun the data if you need to. Sadly, it
doesn't come with a producer to rerun the data and you would have to write
your own.

Let me know if that helps!

Thanks much,
Andrew Clarkson

On Wed, Dec 21, 2016 at 9:32 PM, Stephane Maarek <
steph...@simplemachines.com.au> wrote:

> Hi,
>
> I have Kafka running on EC2 in AWS.
> I would like to backup my data volumes daily in order to recover to a
point
> in time in case of a disaster.
>
> One thing I’m worried about is that if I do an EBS snapshot while Kafka
is
> running, it seems a Kafka that recovers on it will have to deal with
> corrupted logs (it goes through a repair / rebuild index process). It
seems
> that Kafka on shutdown properly closes the logs.
>
> Questions:
> 1) If I take the EBS snapshots while Kafka is running, is it dangerous
that
> a new instance launched from this backup has to go through a repair
> process?
> 2) The other option I see is to stop the Kafka broker, and then take my
EBS
> snapshot. But I can’t do that for all brokers simultaneously as I would
> lose my cluster, so therefore if I do: stop kafka broker, take snapshot,
> start kafka, next broker same steps, I would get a clean backup, but not
a
> point in time backup… is that an issue?
> 3) Are there any other backup strategies I haven’t considered?
>
> Thanks!
> Stephane
>

Reply via email to