I am using clouderas flume example to consume the twitter sample stream and 
store it on HDFS. 

https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java
 
<https://github.com/cloudera/cdh-twitter-example/blob/master/flume-sources/src/main/java/com/cloudera/flume/source/TwitterSource.java>

They put the time a tweet was created at as a header to the flume event 
(“timestamp”). 
From working with the twitter stream earlier I noticed that there usually is 
some lag between the time I receive a tweet and the time the tweet was created.

My goal is to partition tweets by the time they were created. 

Will this timestamp header take care of this?
(using this configuration: 
TwitterAgent.sinks.HDFS.type = hdfs
TwitterAgent.sinks.HDFS.hdfs.path = 
hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/ 
<hdfs://hadoop1:8020/user/flume/tweets/%Y/%m/%d/%H/>
)

What happens (in an extreme case) when a tweet arrives a couple of minutes 
late. Will flume reopen a file from the past hour and add it there? 
If not, how can I achieve a proper partitioning without overlaps between time 
slices (hours)?


Best
Dominik

Reply via email to