I am still not convinced why a stream processing framework closely tied to
Kafka will not help with this (since we are also referring to some basic
transformations). The devil is in the details of the design and I would be
able to better comment on it after that. I would love to see a detailed
design doc on the internals!

On 6/23/15 2:59 PM, "Ewen Cheslack-Postava" <e...@confluent.io> wrote:

>There was some discussion on the KIP call today. I'll give my summary of
>what I heard here to make sure this thread has the complete context for
>ongoing discussion.
>
>* Where the project should live, and if in Kafka, where should connectors
>live? If some are in Kafka and some not, how many and which ones? - There
>was little disagreement on the tradeoffs (coding and packaging together
>can
>make things easier for end users especially for a few very popular
>connectors, maintaining internally can lead to messier code base with more
>dependencies that's harder to work with, etc). Seems to be more focus on
>location of connectors than framework right now; we'll probably only make
>progress on this issue with some concrete proposals.
>* Organizational issues within Kafka - subproject? - Jay mentioned desire
>for consistency, which can be a problem even across subprojects.
>* Will streaming data be supported? - Yes, "Streaming and batch" section
>of
>design goals should cover this; this is a very important use case.
>* Additional transformations in copycat - Updated wiki to leave this a bit
>more open. Original motivation for leaving it out was to keep the scope of
>this KIP and the Copycat framework very clear since there is a danger in
>overgeneralizing and ending up with a stream processing framework;
>however,
>it's clear there are some very useful, very common examples like scrubbing
>data during import.
>* Schemas and how the data model works - this requires a more in depth
>answer when we get to a complete proposal, but the prototype we've been
>playing with internally uses something that can work with data roughly
>like
>Avro or JSON, and supports schemas. The goal is for this data model to
>only
>be used at runtime and for the serialization that is used for storing data
>in Kafka to be pluggable. Each type of serialization plugin might handle
>things like schemas in different ways. The reason we are proposing the
>inclusion of schemas is that it lets you cleanly carry important info
>across multiples stages, e.g. the schema for data pulled from a database
>is
>defined by the table the data is read from, intermediate processing steps
>might maintain schemas as well, and then an export to, e.g., a parquet
>file
>in HDFS would also use the schema. There will definitely need to be
>discussion about the details of this data model, what needs to be included
>to make it work across multiple serialization formats, etc.
>* Could mirror maker be implemented in Copycat? Same for Camus? - Yes,
>both
>would make sense in Copycat. One of the motivations is to have fewer tools
>required for a lot of these common tasks. Mirror maker is a case where we
>could easily maintain the connector as part of Kafka, and we could
>probably
>bootstrap one very quickly using lessons learned from mirror maker. The
>experience with mirror maker is also an argument for making sure Kafka
>devs
>are closely involved in Copycat development -- it's actually tricky to get
>it right even when you know Kafka and Copycat has to get everything right
>for more general cases.
>
>I made minor updates to the wiki to reflect some of these notes. Anyone
>else have any specific updates they think should be made to any of the
>sections, especially considerations I may have omitted from the "rejected
>alternatives" (or any "rejected" alternatives that they think still need
>to
>be under consideration)?
>
>Let me know what you think needs to be addressed to get this to a vote --
>I
>don't want to rush people, but I also don't want to just leave this
>lingering unless there are specific issues that can be addressed.
>
>-Ewen
>
>
>On Mon, Jun 22, 2015 at 8:32 PM, Roshan Naik <ros...@hortonworks.com>
>wrote:
>
>> Thanks Jay and Ewen for the response.
>>
>>
>> >@Jay
>> >
>> > 3. This has a built in notion of parallelism throughout.
>>
>>
>>
>> It was not obvious how it will look like or differ from existing
>>systemsÅ 
>> since all of existing ones do parallelize data movement.
>>
>>
>> @Ewen,
>>
>> >Import: Flume is just one of many similar systems designed around log
>> >collection. See notes below, but one major point is that they generally
>> >don't provide any sort of guaranteed delivery semantics.
>>
>>
>> I think most of them do provide guarantees of some sort (Ex. Flume &
>> FluentD).
>>
>>
>> >YARN: My point isn't that YARN is bad, it's that tying to any
>>particular
>> >cluster manager severely limits the applicability of the tool. The
>>goal is
>> >to make Copycat agnostic to the cluster manager so it can run under
>>Mesos,
>> >YARN, etc.
>>
>> ok. Got it. Sounds like there is plan to do some work here to ensure
>> out-of-the-box it works with more than one scheduler (as @Jay listed
>>out).
>> In that case, IMO it would be better to actually rephrase it in the KIP
>> that it will support more than one scheduler.
>>
>>
>> >Exactly once: You accomplish this in any system by managing offsets in
>>the
>> >destination system atomically with the data or through some kind of
>> >deduplication. Jiangjie actually just gave a great talk about this
>>issue
>> >at
>> >a recent Kafka meetup, perhaps he can share some slides about it. When
>>you
>> >see all the details involved, you'll see why I think it might be nice
>>to
>> >have the framework help you manage the complexities of achieving
>>different
>> >delivery semantics ;)
>>
>>
>> Deduplication as a post processing step is a common recommendation done
>> today Å  but that is a workaround/fix for the inability to provide
>> exactly-once by the delivery systems. IMO such post processing should
>>not
>> be considered part of the "exacty-once" guarantee of Copycat.
>>
>>
>> Will be good to know how this guarantee will be possible when delivering
>> to HDFS.
>> Would be great if someone can share those slides if it is discussed
>>there.
>>
>>
>>
>>
>> Was looking for clarification on this ..
>> - Export side - is this like a map reduce kind of job or something else
>>?
>> If delivering to hdfs would this be running on the hadoop cluster or
>> outside ?
>> - Import side - how does this look ? Is it a bunch of flume like
>>processes
>> ? maybe just some kind of a broker that translates the incoming protocol
>> into outgoing Kafka producer api protocol ? If delivering to hdfs, will
>> this run on the cluster or outside ?
>>
>>
>> I still think adding one or two specific end-to-end use-cases in the
>>KIP,
>> showing how copycat will pan out for them for import/export will really
>> clarify things.
>>
>>
>>
>>
>
>
>-- 
>Thanks,
>Ewen

Reply via email to