Hi Inder, Can you briefly summarize what you want to do and what is missing from flume for you to do it?
Seems like you could store in a structure like this with static configs: flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz As you mentioned, you would have to use one HDFS sink per stream/collector pair to define this statically in the config files. Is the problem that you want events to be strictly contained in a log file named according to their internal timestamp? Is it not acceptable to go by event delivery time at the agent? Best, Mike On Apr 12, 2012, at 12:33 AM, Inder Pall wrote: > Mike and Hari, > > Appreciate your prompt and detailed responses. > > 1. For timestamp header at agent - OOZIE based consumer work flows wait > for data in a directory structure like -* /flume-data/YYYY/MM/DD/HH/MN/.*.we > have a *contract* - a minute level directory can be consumed(is *immutable*) > once the next minute directory is available. If "*timestamp*" injected by * > clientLib* is used it's difficult to guarantee this contract *(messages > coming late, clocks not synchronized, etc)*. > > Mike, specifically the configuration i was planning for is > ((N) ClientLib *=>* (N) event-generating agents (Avro Source + AvroSink)) > => (M < N) collector agents (AvroSource + HDFSEventSink) > > 2. I agree that the agentPath use-case can be supported without headers > through a separate HDFSEventSink configuration. This will ensure different > agent's write to different path's(t*hereby avoiding any critical > section)*issues. > > Reason for asking was to avoid directory structure like - > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > ...................................... > ...................................... > */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > and instead have > */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz > (*Adding > the collectorName avoids multiple folks writing to the same file issue*) > > However it's tough to obey the above mentioned contract - collector1 has > moved forward to a new directory and collector2 is still writing to the old > minute directory. Just wanted to avoid the additional hop of moving data > from collector specific directories to one unified location, though i can > live with it. > > I don't want to do something specific here and end maintaining a different > version of FLUME :( > Let me know what you guys think, i believe as the adoption grows so will > use-cases which require adding/modifying headers at avroSource. > > Looking forward to hearing from you folks > > > - Inder > On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> wrote: > >> Well, we almost support #1, although the way to do it is pass a >> "timestamp" header at the first hop. Then you can use the BucketPath >> shorthand stuff to name the hdfs.path according to this spec (except for >> the agent-hostname thing). >> >> With #2 it seems reasonable to add support for an arbitrary "tag" header >> or something like that which one could use in the hdfs.path as well. But it >> would have to come from the first-hop agent at this point. The tag could >> take the place of the hostname. >> >> Something that might get Flume closer to the below vision without hacking >> the core is adding support for a plugin interface to AvroSource which can >> annotate headers. However I worry that people might take this and try to do >> all kinds of parsing and whatnot. So I think the first cut should only >> support reading & setting headers. This is basically a "routing" feature >> which I would argue Flume needs to be good at and flexible for. >> >> Just in case I misinterpreted the use case, I want to make sure we are not >> trying to have multiple HDFSEventSink agents append to the same HDFS file >> simultaneously, since I am pretty sure Hadoop doesn't support that. >> >> Inder, just to clarify, is this what you are doing? >> >> (N) event-generating agents (Custom Source + AvroSink) => (M < N) >> collector agents (AvroSource + AvroSink) => Load-Balancing VIP => >> (AvroSource + HDFSEventSink) => HDFS >> >> Best, >> Mike >> >> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: >> >>> Hi Inder, >>> >>> I think these use cases are quite specific to your requirements. Even >> though I did not clearly understand (2), I think that can be addressed >> through configuration, and you would not need to add any new code for that. >> I don't understand why you would want to inject a header in that case. You >> can simply have different configurations for each of the agents, with >> different sink paths. So agent A would have a sink configured to write to >> /flume-data/agenta/.… and so on. >>> >>> I don't think we have support for something like (1) as of now. It does >> not look like something which is very generic, and have not heard of >> someone else having such a requirement. If you want this, the only way I >> can see it, is to pick up AvroSource and add this support, and make it >> configurable(on/off switch in the conf). >>> >>> Thanks >>> Hari >>> >>> -- >>> Hari Shreedharan >>> >>> >>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: >>> >>>> Folks, >>>> >>>> i have two use-cases and both seem to be landing under this requirement >>>> >>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. >>>> Timestamp is the arrival time on this agent. >>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat is >> i >>>>> >>>> >>>> want to pass this header at the final agent in pipeline. >>>> 2. Have multiple flume agents configured behind a VIP writing to the >> same >>>> HDFS sink path. >>>>>> One of the way's is to have the path like - >>>>> >>>> >>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN >>>> Again can be addressed by passing a header "hostname" at flume agent and >>>> configuring the sink path appropriately. >>>> >>>> Would appreciate any help on how to address this in a generic way in >> FLUME. >>>> Seems to be a generic use-case for anyone planning to take FLUME to >>>> production. >>>> >>>> -- >>>> Thanks, >>>> - Inder >>>> Tech Platforms @Inmobi >>>> Linkedin - http://goo.gl/eR4Ub >>>> >>>> >>> >>> >> >> > > > -- > Thanks, > - Inder > Tech Platforms @Inmobi > Linkedin - http://goo.gl/eR4Ub
