A more scalable Kafka to Hadoop InputFormat

Casey Green Thu, 30 Oct 2014 07:43:13 -0700

Hi Folks,

I'm open sourcing a scalable Kafka InputFormat.  As far as I know or am aware 
of, my version is unique compared to other Kafka InputFormats out there, in 
that input splits are mapped to Kafka log files, rather than entire Kafka 
partitions.  Mapping Kafka log files to input splits scales your Map/Reduce job 
by the amount of data left to consume in a queue, whereas mapping input splits 
to entire partitions always gives you a constant number of input splits.


I wrote up a blog post about it 
here<http://www.conductor.com/nightlight/data-stream-processing-bulk-kafka-hadoop/>,
 and the source code for my KafkaInputFormat is on 
github<https://github.com/Conductor/kangaroo>.  Your questions, comments and 
feedback are welcomed and much appreciated!

Thanks,
Casey Green

A more scalable Kafka to Hadoop InputFormat

Reply via email to