Well, we almost support #1, although the way to do it is pass a "timestamp" header at the first hop. Then you can use the BucketPath shorthand stuff to name the hdfs.path according to this spec (except for the agent-hostname thing).
With #2 it seems reasonable to add support for an arbitrary "tag" header or something like that which one could use in the hdfs.path as well. But it would have to come from the first-hop agent at this point. The tag could take the place of the hostname. Something that might get Flume closer to the below vision without hacking the core is adding support for a plugin interface to AvroSource which can annotate headers. However I worry that people might take this and try to do all kinds of parsing and whatnot. So I think the first cut should only support reading & setting headers. This is basically a "routing" feature which I would argue Flume needs to be good at and flexible for. Just in case I misinterpreted the use case, I want to make sure we are not trying to have multiple HDFSEventSink agents append to the same HDFS file simultaneously, since I am pretty sure Hadoop doesn't support that. Inder, just to clarify, is this what you are doing? (N) event-generating agents (Custom Source + AvroSink) => (M < N) collector agents (AvroSource + AvroSink) => Load-Balancing VIP => (AvroSource + HDFSEventSink) => HDFS Best, Mike On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: > Hi Inder, > > I think these use cases are quite specific to your requirements. Even though > I did not clearly understand (2), I think that can be addressed through > configuration, and you would not need to add any new code for that. I don't > understand why you would want to inject a header in that case. You can simply > have different configurations for each of the agents, with different sink > paths. So agent A would have a sink configured to write to > /flume-data/agenta/.… and so on. > > I don't think we have support for something like (1) as of now. It does not > look like something which is very generic, and have not heard of someone else > having such a requirement. If you want this, the only way I can see it, is to > pick up AvroSource and add this support, and make it configurable(on/off > switch in the conf). > > Thanks > Hari > > -- > Hari Shreedharan > > > On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: > >> Folks, >> >> i have two use-cases and both seem to be landing under this requirement >> >> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. >> Timestamp is the arrival time on this agent. >>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat is i >>> >> >> want to pass this header at the final agent in pipeline. >> 2. Have multiple flume agents configured behind a VIP writing to the same >> HDFS sink path. >>>> One of the way's is to have the path like - >>> >> >> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN >> Again can be addressed by passing a header "hostname" at flume agent and >> configuring the sink path appropriately. >> >> Would appreciate any help on how to address this in a generic way in FLUME. >> Seems to be a generic use-case for anyone planning to take FLUME to >> production. >> >> -- >> Thanks, >> - Inder >> Tech Platforms @Inmobi >> Linkedin - http://goo.gl/eR4Ub >> >> > >
