Re: [copycat] support for "batch" jobs

Jay Kreps Fri, 14 Aug 2015 10:57:56 -0700

I thought batch was dead? :-)

Yeah I think this would be really useful. Kafka kind of allows you to unify
batch and streams since you produce or consume your stream on your own
schedule so you would want the ingress/egress to work the same.

Ewen, rather than sleeping, I think the use case is that I want to be able
to crontab up the copycat process to run hourly or daily to either push or
pull data and then quit when there is no more data. Scheduling the process
to start is easy, the challenge is how does copycat know it is done?

The sink side is a little easier since you can define the end of the stream
to be the last offset for each partition at the time the connector starts
(this is what Camus does iirc). So at startup you check the end offset for
each partition, a partition is complete when it reaches that offset. When
all jobs are complete the process exists.

Not sure how the source side could work since the offset concept is
heterogenous across different systems.

-Jay

On Thu, Aug 13, 2015 at 10:23 PM, Gwen Shapira <[email protected]> wrote:

> Hi Team Kafka,
>
> (sorry for the flood, this is last one! promise!)
>
> If you tried out PR-99, you know that CopyCat now does on-going
> export/import. So it will continuously read data from a source and write it
> to Kafka (or vice versa). This is great for tailing logs and replicating
> from MySQL binlog.
>
> But, I'm wondering if there's a need for a batch-mode too.
> This can be useful for:
> * Camus-like thing. You can stream data to HDFS, but the benefits are
> limited and there are some known issues there.
> * Dump large parts of an RDBMS at once.
>
> Do you agree that this need exist? or is stream export/import good enough?
>
> Also, anyone has ideas how he would like the batch mode to work?
>
> Gwen
>

Re: [copycat] support for "batch" jobs

Reply via email to