Well, we almost support #1, although the way to do it is pass a "timestamp" 
header at the first hop. Then you can use the BucketPath shorthand stuff to 
name the hdfs.path according to this spec (except for the agent-hostname thing).

With #2 it seems reasonable to add support for an arbitrary "tag" header or 
something like that which one could use in the hdfs.path as well. But it would 
have to come from the first-hop agent at this point. The tag could take the 
place of the hostname.

Something that might get Flume closer to the below vision without hacking the 
core is adding support for a plugin interface to AvroSource which can annotate 
headers. However I worry that people might take this and try to do all kinds of 
parsing and whatnot. So I think the first cut should only support reading & 
setting headers. This is basically a "routing" feature which I would argue 
Flume needs to be good at and flexible for.

Just in case I misinterpreted the use case, I want to make sure we are not 
trying to have multiple HDFSEventSink agents append to the same HDFS file 
simultaneously, since I am pretty sure Hadoop doesn't support that.

Inder, just to clarify, is this what you are doing?

(N) event-generating agents (Custom Source + AvroSink) => (M < N) collector 
agents (AvroSource + AvroSink) => Load-Balancing VIP => (AvroSource + 
HDFSEventSink) => HDFS

Best,
Mike

On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote:

> Hi Inder,  
> 
> I think these use cases are quite specific to your requirements. Even though 
> I did not clearly understand (2), I think that can be addressed through 
> configuration, and you would not need to add any new code for that. I don't 
> understand why you would want to inject a header in that case. You can simply 
> have different configurations for each of the agents, with different sink 
> paths. So agent A would have a sink configured to write to 
> /flume-data/agenta/.… and so on.  
> 
> I don't think we have support for something like (1) as of now. It does not 
> look like something which is very generic, and have not heard of someone else 
> having such a requirement. If you want this, the only way I can see it, is to 
> pick up AvroSource and add this support, and make it configurable(on/off 
> switch in the conf).
> 
> Thanks
> Hari
> 
> --  
> Hari Shreedharan
> 
> 
> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote:
> 
>> Folks,
>> 
>> i have two use-cases and both seem to be landing under this requirement
>> 
>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN.
>> Timestamp is the arrival time on this agent.
>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat is i
>>> 
>> 
>> want to pass this header at the final agent in pipeline.
>> 2. Have multiple flume agents configured behind a VIP writing to the same
>> HDFS sink path.
>>>> One of the way's is to have the path like -
>>> 
>> 
>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN
>> Again can be addressed by passing a header "hostname" at flume agent and
>> configuring the sink path appropriately.
>> 
>> Would appreciate any help on how to address this in a generic way in FLUME.
>> Seems to be a generic use-case for anyone planning to take FLUME to
>> production.
>> 
>> --  
>> Thanks,
>> - Inder
>> Tech Platforms @Inmobi
>> Linkedin - http://goo.gl/eR4Ub
>> 
>> 
> 
> 

Reply via email to