Re: future of Camus?

2015-10-23 Thread Adrian Woodhead
Thanks everyone for your input on this thread, looks like a hot topic ;)

I thought I'd reply to everyone's feedback in one go rather than have lots of 
separate replies, so here goes...

Henry - thanks for pointing out Secor, I had never seen it before. I can see 
why not having a Hadoop dependency can be appealing but in our case we actually 
like the dependency as for Camus it means we can scale the job out on the 
cluster without having to do anything extra ourselves. The documentation also 
makes it look Secor is very S3-centric while we're interested in HDFS.

Guozhang - Copycat certainly looks very promising and again I'd never come 
across this. An HDFS export connector that runs on YARN would probably be what 
we'd be looking for and could potentially do what Camus does while being more 
tightly integrated with Kafka should mean it's less likely to be orphaned. 
We'll certainly keep an eye on this although it looks like it's probably not 
production ready yet? It also wasn't immediately clear how one would use it to 
run on YARN - our jobs are typically started on lightweight machines which have 
limited resources so we want to delegate as much as possible to the cluster 
nodes for parallelising the work with as little setup on our part as we can get 
away with.

Todd - we looked at Kaboom but we don't use Avro and need to control the 
formats of the files we create on HDFS (typically ORC and SequenceFile) along 
with also wanting full control over the HDFS paths where the files are created. 
Camus has extension points that allowed us to write our own 
RecordWriterProvider, Partitioner and MessageDecoder all of which we use and 
none of which we saw as possible in Kaboom as it currently stands. Apologies if 
we've overlooked something here.

Vivek - we also considered Flume/Flafka but we're actually trying to reduce the 
number of technologies we're using and part of the reason for us using Kafka is 
to have *one* standard mechanism for getting data in and out of Hadoop and the 
intention is for this to replace our existing Flume infrastucture. I appreciate 
that Flume can do the job but in terms of operational complexity we'd prefer to 
have fewer moving parts and we felt Camus was less complex than adding Flume to 
the end of the data pipeline.

So it sounds like Camus still has features that can't easily be replicated in 
any of the other solutions as they currently stand. It also appears that nobody 
here is keen on working on an official fork of Camus, possibly since they're 
using or working on the alternatives above? I made a similar post on the 
"Camus_etl" group 
(https://groups.google.com/forum/#!topic/camus_etl/jUkX4zC4oF0) and some 
parties there indicated that they would be interested in an official Camus fork 
or some way of keeping the current Camus codebase in existence with new 
features being added to it going forward so we'll see where that goes.

If anyone has any other opinions or thoughts please let me know. 

Thanks,

Adrian


From: vivek thakre <vivek.tha...@gmail.com>
Sent: 22 October 2015 23:44
To: users@kafka.apache.org
Subject: Re: future of Camus?

We are using Apache Flume as a router to consume data from Kafka and push
to HDFS.
With Flume 1.6, Kafka Channel, Source and Sink are available out of the box.

Here is the blog post from Cloudera
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

Thanks,

Vivek Thakre



On Thu, Oct 22, 2015 at 2:29 PM, Hawin Jiang <hawin.ji...@gmail.com> wrote:

> Very useful information for us.
> Thanks Guozhang.
> On Oct 22, 2015 2:02 PM, "Guozhang Wang" <wangg...@gmail.com> wrote:
>
> > Hi Adrian,
> >
> > Another alternative approach is to use Kafka's own Copycat framework for
> > data ingressing / egressing. It will be released in our 0.9.0 version
> > expected in Nov.
> >
> > Under Copycat users can write different "connector" instantiated for
> > different source / sink systems, while for your case there is a in-built
> > HDFS connector coming along with the framework itself. You can find more
> > details in these Kafka wikis / java docs:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> >
> >
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
> >
> > Guozhang
> >
> >
> > On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <h...@pinterest.com.invalid>
> > wrote:
> >
> > > Take a look at secor:
> > >
> > > https://github.com/pinterest/secor
> > >
> > > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > > underlying systems such as Hadoop, it only uses Kafka high level
> consumer
> > > to bal

Re: future of Camus?

2015-10-22 Thread Henry Cai
Take a look at secor:

https://github.com/pinterest/secor

Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
underlying systems such as Hadoop, it only uses Kafka high level consumer
to balance the work loads.  Very easy to understand and manage.  It's
probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
Lots of web companies use this to do the kafka data ingestion
(Pinterest/Uber/AirBnb).


On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead 
wrote:

> Hello all,
>
> We're looking at options for getting data from Kafka onto HDFS and Camus
> looks like the natural choice for this. It's also evident that LinkedIn who
> originally created Camus are taking things in a different direction and are
> advising people to use their Gobblin ETL framework instead. We feel that
> Gobblin is overkill for many simple use cases and Camus seems a much
> simpler and better fit. The problem now is that with LinkedIn apparently
> withdrawing official support for it it appears that any changes to Camus
> are being managed by various forks of it and it looks like everyone is
> building and using their own versions. Wouldn't it be better for a
> community to form around one official fork so development efforts can be
> focused on this? Any thoughts on this?
>
> Thanks,
>
> Adrian
>
>


future of Camus?

2015-10-22 Thread Adrian Woodhead
Hello all,

We're looking at options for getting data from Kafka onto HDFS and Camus looks 
like the natural choice for this. It's also evident that LinkedIn who 
originally created Camus are taking things in a different direction and are 
advising people to use their Gobblin ETL framework instead. We feel that 
Gobblin is overkill for many simple use cases and Camus seems a much simpler 
and better fit. The problem now is that with LinkedIn apparently withdrawing 
official support for it it appears that any changes to Camus are being managed 
by various forks of it and it looks like everyone is building and using their 
own versions. Wouldn't it be better for a community to form around one official 
fork so development efforts can be focused on this? Any thoughts on this?

Thanks,

Adrian



Re: future of Camus?

2015-10-22 Thread Todd Snyder
Another alternative is to checkout Kaboom

‎  https://github.com/blackberry/KaBoom

‎It uses a pared down kafka consumer library to pull data from Kafka and write 
it to defined (and somewhat dynamic) hdfs paths in a custom (and changeable) 
avro schema we call boom. It uses kerberos for authentication, and supports 
very high throughout.

It's still actively being developed, with a new release coming soon with 
enhanced configuration through a new rest api (kontroller).

Cheers

Todd.



Sent from my BlackBerry 10 smartphone on the TELUS network.
  Original Message
From: Guozhang Wang
Sent: Thursday, October 22, 2015 5:03 PM
To: users@kafka.apache.org
Reply To: users@kafka.apache.org
Subject: Re: future of Camus?


Hi Adrian,

Another alternative approach is to use Kafka's own Copycat framework for
data ingressing / egressing. It will be released in our 0.9.0 version
expected in Nov.

Under Copycat users can write different "connector" instantiated for
different source / sink systems, while for your case there is a in-built
HDFS connector coming along with the framework itself. You can find more
details in these Kafka wikis / java docs:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html

Guozhang


On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai <h...@pinterest.com.invalid>
wrote:

> Take a look at secor:
>
> https://github.com/pinterest/secor
>
> Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> underlying systems such as Hadoop, it only uses Kafka high level consumer
> to balance the work loads.  Very easy to understand and manage.  It's
> probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> Lots of web companies use this to do the kafka data ingestion
> (Pinterest/Uber/AirBnb).
>
>
> On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead <awoodh...@hotels.com>
> wrote:
>
> > Hello all,
> >
> > We're looking at options for getting data from Kafka onto HDFS and Camus
> > looks like the natural choice for this. It's also evident that LinkedIn
> who
> > originally created Camus are taking things in a different direction and
> are
> > advising people to use their Gobblin ETL framework instead. We feel that
> > Gobblin is overkill for many simple use cases and Camus seems a much
> > simpler and better fit. The problem now is that with LinkedIn apparently
> > withdrawing official support for it it appears that any changes to Camus
> > are being managed by various forks of it and it looks like everyone is
> > building and using their own versions. Wouldn't it be better for a
> > community to form around one official fork so development efforts can be
> > focused on this? Any thoughts on this?
> >
> > Thanks,
> >
> > Adrian
> >
> >
>



--
-- Guozhang


Re: future of Camus?

2015-10-22 Thread Guozhang Wang
Hi Adrian,

Another alternative approach is to use Kafka's own Copycat framework for
data ingressing / egressing. It will be released in our 0.9.0 version
expected in Nov.

Under Copycat users can write different "connector" instantiated for
different source / sink systems, while for your case there is a in-built
HDFS connector coming along with the framework itself. You can find more
details in these Kafka wikis / java docs:

https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767

https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html

Guozhang


On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai 
wrote:

> Take a look at secor:
>
> https://github.com/pinterest/secor
>
> Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> underlying systems such as Hadoop, it only uses Kafka high level consumer
> to balance the work loads.  Very easy to understand and manage.  It's
> probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> Lots of web companies use this to do the kafka data ingestion
> (Pinterest/Uber/AirBnb).
>
>
> On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead 
> wrote:
>
> > Hello all,
> >
> > We're looking at options for getting data from Kafka onto HDFS and Camus
> > looks like the natural choice for this. It's also evident that LinkedIn
> who
> > originally created Camus are taking things in a different direction and
> are
> > advising people to use their Gobblin ETL framework instead. We feel that
> > Gobblin is overkill for many simple use cases and Camus seems a much
> > simpler and better fit. The problem now is that with LinkedIn apparently
> > withdrawing official support for it it appears that any changes to Camus
> > are being managed by various forks of it and it looks like everyone is
> > building and using their own versions. Wouldn't it be better for a
> > community to form around one official fork so development efforts can be
> > focused on this? Any thoughts on this?
> >
> > Thanks,
> >
> > Adrian
> >
> >
>



-- 
-- Guozhang


Re: future of Camus?

2015-10-22 Thread vivek thakre
We are using Apache Flume as a router to consume data from Kafka and push
to HDFS.
With Flume 1.6, Kafka Channel, Source and Sink are available out of the box.

Here is the blog post from Cloudera
http://blog.cloudera.com/blog/2014/11/flafka-apache-flume-meets-apache-kafka-for-event-processing/

Thanks,

Vivek Thakre



On Thu, Oct 22, 2015 at 2:29 PM, Hawin Jiang  wrote:

> Very useful information for us.
> Thanks Guozhang.
> On Oct 22, 2015 2:02 PM, "Guozhang Wang"  wrote:
>
> > Hi Adrian,
> >
> > Another alternative approach is to use Kafka's own Copycat framework for
> > data ingressing / egressing. It will be released in our 0.9.0 version
> > expected in Nov.
> >
> > Under Copycat users can write different "connector" instantiated for
> > different source / sink systems, while for your case there is a in-built
> > HDFS connector coming along with the framework itself. You can find more
> > details in these Kafka wikis / java docs:
> >
> >
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
> >
> >
> >
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
> >
> > Guozhang
> >
> >
> > On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai 
> > wrote:
> >
> > > Take a look at secor:
> > >
> > > https://github.com/pinterest/secor
> > >
> > > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > > underlying systems such as Hadoop, it only uses Kafka high level
> consumer
> > > to balance the work loads.  Very easy to understand and manage.  It's
> > > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> > > Lots of web companies use this to do the kafka data ingestion
> > > (Pinterest/Uber/AirBnb).
> > >
> > >
> > > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead  >
> > > wrote:
> > >
> > > > Hello all,
> > > >
> > > > We're looking at options for getting data from Kafka onto HDFS and
> > Camus
> > > > looks like the natural choice for this. It's also evident that
> LinkedIn
> > > who
> > > > originally created Camus are taking things in a different direction
> and
> > > are
> > > > advising people to use their Gobblin ETL framework instead. We feel
> > that
> > > > Gobblin is overkill for many simple use cases and Camus seems a much
> > > > simpler and better fit. The problem now is that with LinkedIn
> > apparently
> > > > withdrawing official support for it it appears that any changes to
> > Camus
> > > > are being managed by various forks of it and it looks like everyone
> is
> > > > building and using their own versions. Wouldn't it be better for a
> > > > community to form around one official fork so development efforts can
> > be
> > > > focused on this? Any thoughts on this?
> > > >
> > > > Thanks,
> > > >
> > > > Adrian
> > > >
> > > >
> > >
> >
> >
> >
> > --
> > -- Guozhang
> >
>


Re: future of Camus?

2015-10-22 Thread Hawin Jiang
Very useful information for us.
Thanks Guozhang.
On Oct 22, 2015 2:02 PM, "Guozhang Wang"  wrote:

> Hi Adrian,
>
> Another alternative approach is to use Kafka's own Copycat framework for
> data ingressing / egressing. It will be released in our 0.9.0 version
> expected in Nov.
>
> Under Copycat users can write different "connector" instantiated for
> different source / sink systems, while for your case there is a in-built
> HDFS connector coming along with the framework itself. You can find more
> details in these Kafka wikis / java docs:
>
> https://cwiki.apache.org/confluence/pages/viewpage.action?pageId=58851767
>
>
> https://s3-us-west-2.amazonaws.com/confluent-files/copycat-docs-wip/intro.html
>
> Guozhang
>
>
> On Thu, Oct 22, 2015 at 12:52 PM, Henry Cai 
> wrote:
>
> > Take a look at secor:
> >
> > https://github.com/pinterest/secor
> >
> > Secor is a no-frill kafka->HDFS/Ingesting tool, doesn't depend on any
> > underlying systems such as Hadoop, it only uses Kafka high level consumer
> > to balance the work loads.  Very easy to understand and manage.  It's
> > probably the 2nd most popular kafka/HDFS ingestion tool (behind camus).
> > Lots of web companies use this to do the kafka data ingestion
> > (Pinterest/Uber/AirBnb).
> >
> >
> > On Thu, Oct 22, 2015 at 3:56 AM, Adrian Woodhead 
> > wrote:
> >
> > > Hello all,
> > >
> > > We're looking at options for getting data from Kafka onto HDFS and
> Camus
> > > looks like the natural choice for this. It's also evident that LinkedIn
> > who
> > > originally created Camus are taking things in a different direction and
> > are
> > > advising people to use their Gobblin ETL framework instead. We feel
> that
> > > Gobblin is overkill for many simple use cases and Camus seems a much
> > > simpler and better fit. The problem now is that with LinkedIn
> apparently
> > > withdrawing official support for it it appears that any changes to
> Camus
> > > are being managed by various forks of it and it looks like everyone is
> > > building and using their own versions. Wouldn't it be better for a
> > > community to form around one official fork so development efforts can
> be
> > > focused on this? Any thoughts on this?
> > >
> > > Thanks,
> > >
> > > Adrian
> > >
> > >
> >
>
>
>
> --
> -- Guozhang
>