Re: Cross-Data-Center Mirroring, and Guaranteed Minimum Time Period on Data

2014-10-16 Thread Andrew Otto
Check out Camus. It was built to do parallel loads from Kafka into time bucketed directories in HDFS. On Oct 16, 2014, at 9:32 AM, Gwen Shapira wrote: > I assume the messages themselves contain the timestamp? > > If you use Flume, you can configure a Kafka source to pull data from > Kafka,

Re: Cross-Data-Center Mirroring, and Guaranteed Minimum Time Period on Data

2014-10-16 Thread Gwen Shapira
I assume the messages themselves contain the timestamp? If you use Flume, you can configure a Kafka source to pull data from Kafka, use an interceptor to pull the date out of your message and place it in the event header and then the HDFS sink can write to a partition based on the timestamp. Gwen

Re: Cross-Data-Center Mirroring, and Guaranteed Minimum Time Period on Data

2014-10-15 Thread Jun Rao
One way you can do that is to continually load data from Kafka to Hadoop. During load, you put data into different HDFS directories based on the timestamp. The Hadoop admin can decide when to open up those directories for read based on whether data from all data centers have arrived. Thanks, Jun

Cross-Data-Center Mirroring, and Guaranteed Minimum Time Period on Data

2014-10-14 Thread Alex Melville
Hi Apache Community, My company has the following use case. We have multiple geographically disparate data centers each with their own Kafka cluster, and we want to aggregate all of these center's data to one central Kafka cluster located in a data center distinct from the rest using MirrorMaker.