Thanks everyone for your input on this thread, looks like a hot topic ;) I thought I'd reply to everyone's feedback in one go rather than have lots of separate replies, so here goes...
Henry - thanks for pointing out Secor, I had never seen it before. I can see why not having a Hadoop dependency can be appealing but in our case we actually like the dependency as for Camus it means we can scale the job out on the cluster without having to do anything extra ourselves. The documentation also makes it look Secor is very S3-centric while we're interested in HDFS. Guozhang - Copycat certainly looks very promising and again I'd never come across this. An HDFS export connector that runs on YARN would probably be what we'd be looking for and could potentially do what Camus does while being more tightly integrated with Kafka should mean it's less likely to be orphaned. We'll certainly keep an eye on this although it looks like it's probably not production ready yet? It also wasn't immediately clear how one would use it to run on YARN - our jobs are typically started on lightweight machines which have limited resources so we want to delegate as much as possible to the cluster nodes for parallelising the work with as little setup on our part as we can get away with. Todd - we looked at Kaboom but we don't use Avro and need to control the formats of the files we create on HDFS (typically ORC and SequenceFile) along with also wanting full control over the HDFS paths where the files are created. Camus has extension points that allowed us to write our own RecordWriterProvider, Partitioner and MessageDecoder all of which we use and none of which we saw as possible in Kaboom as it currently stands. Apologies if we've overlooked something here. Vivek - we also considered Flume/Flafka but we're actually trying to reduce the number of technologies we're using and part of the reason for us using Kafka is to have *one* standard mechanism for getting data in and out of Hadoop and the intention is for this to replace our existing Flume infrastucture. I appreciate that Flume can do the job but in terms of operational complexity we'd prefer to have fewer moving parts and we felt Camus was less complex than adding Flume to the end of the data pipeline. So it sounds like Camus still has features that can't easily be replicated in any of the other solutions as they currently stand. It also appears that nobody here is keen on working on an official fork of Camus, possibly since they're using or working on the alternatives above? I made a similar post on the "Camus_etl" group (https://groups.google.com/forum/#!topic/camus_etl/jUkX4zC4oF0) and some parties there indicated that they would be interested in an official Camus fork or some way of keeping the current Camus codebase in existence with new features being added to it going forward so we'll see where that goes. If anyone has any other opinions or thoughts please let me know. Thanks, Adrian ________________________________________ From: vivek thakre <vivek.tha...@gmail.com> Sent: 22 October 2015 23:44 To: users@kafka.apache.org Subject: Re: future of Camus? We are using Apache Flume as a router to consume data from Kafka and push to HDFS. With Flume 1.6, Kafka Channel, Source and Sink are available out of the box. Here is the blog post from Cloudera http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/ Thanks, Vivek Thakre On Thu, Oct 22, 2015 at 2:29 PM, Hawin Jiang <hawin.ji...@gmail.com> wrote: > Very useful information for us. > Thanks Guozhang. > On Oct 22, 2015 2:02 PM, "Guozhang Wang" <wangg...@gmail.com> wrote: > > > Hi Adrian, > > > > Another alternative approach is to use Kafka's own Copycat framework for > > data ingressing / egressing. It will be released in our 0.9.0 version > > expected in Nov. > > > > Under Copycat users can write different "connector" instantiated for > > different source / sink systems, while for your case there is a in-built > > HDFS connector coming along with the framework itself. You can find more > > details in these Kafka wikis / java docs: > > > > > https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767 > > > > > > > https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html > > > > Guozhang > > > > > > On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <h...@pinterest.com.invalid> > > wrote: > > > > > Take a look at secor: > > > > > > https://github.com/pinterest/secor > > > > > > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any > > > underlying systems such as Hadoop, it only uses Kafka high level > consumer > > > to balance the work loads. Very easy to understand and manage. It's > > > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus). > > > Lots of web companies use this to do the kafka data ingestion > > > (Pinterest/Uber/AirBnb). > > > > > > > > > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <awoodh...@hotels.com > > > > > wrote: > > > > > > > Hello all, > > > > > > > > We're looking at options for getting data from Kafka onto HDFS and > > Camus > > > > looks like the natural choice for this. It's also evident that > LinkedIn > > > who > > > > originally created Camus are taking things in a different direction > and > > > are > > > > advising people to use their Gobblin ETL framework instead. We feel > > that > > > > Gobblin is overkill for many simple use cases and Camus seems a much > > > > simpler and better fit. The problem now is that with LinkedIn > > apparently > > > > withdrawing official support for it it appears that any changes to > > Camus > > > > are being managed by various forks of it and it looks like everyone > is > > > > building and using their own versions. Wouldn't it be better for a > > > > community to form around one official fork so development efforts can > > be > > > > focused on this? Any thoughts on this? > > > > > > > > Thanks, > > > > > > > > Adrian > > > > > > > > > > > > > > > > > > > -- > > -- Guozhang > > >