Thanks Jay and Ewen for the response.

>@Jay
>
> 3. This has a built in notion of parallelism throughout.



It was not obvious how it will look like or differ from existing systemsÅ 
since all of existing ones do parallelize data movement.


@Ewen,

>Import: Flume is just one of many similar systems designed around log
>collection. See notes below, but one major point is that they generally
>don't provide any sort of guaranteed delivery semantics.


I think most of them do provide guarantees of some sort (Ex. Flume &
FluentD). 


>YARN: My point isn't that YARN is bad, it's that tying to any particular
>cluster manager severely limits the applicability of the tool. The goal is
>to make Copycat agnostic to the cluster manager so it can run under Mesos,
>YARN, etc.

ok. Got it. Sounds like there is plan to do some work here to ensure
out-of-the-box it works with more than one scheduler (as @Jay listed out).
In that case, IMO it would be better to actually rephrase it in the KIP
that it will support more than one scheduler.


>Exactly once: You accomplish this in any system by managing offsets in the
>destination system atomically with the data or through some kind of
>deduplication. Jiangjie actually just gave a great talk about this issue
>at
>a recent Kafka meetup, perhaps he can share some slides about it. When you
>see all the details involved, you'll see why I think it might be nice to
>have the framework help you manage the complexities of achieving different
>delivery semantics ;)


Deduplication as a post processing step is a common recommendation done
today Å  but that is a workaround/fix for the inability to provide
exactly-once by the delivery systems. IMO such post processing should not
be considered part of the "exacty-once" guarantee of Copycat.


Will be good to know how this guarantee will be possible when delivering
to HDFS.
Would be great if someone can share those slides if it is discussed there.




Was looking for clarification on this ..
- Export side - is this like a map reduce kind of job or something else ?
If delivering to hdfs would this be running on the hadoop cluster or
outside ?
- Import side - how does this look ? Is it a bunch of flume like processes
? maybe just some kind of a broker that translates the incoming protocol
into outgoing Kafka producer api protocol ? If delivering to hdfs, will
this run on the cluster or outside ?


I still think adding one or two specific end-to-end use-cases in the KIP,
showing how copycat will pan out for them for import/export will really
clarify things.



Reply via email to