how about supporting something like "host2.sources.src1.header.timestamp=true" as config. This overrides time-stamp header on host2->src1(avro source) for all events.
- inder On Fri, Apr 13, 2012 at 2:29 PM, Mike Percy <[email protected]> wrote: > Funny, I thought it was doing the opposite thing. :) > > It should be very easy to implement what you are describing and seems like > a common use case. We just need some decent syntax or a configuration > setting to indicate which timestamp we are talking about. > > Mike > > On Apr 13, 2012, at 1:41 AM, Inder Pall wrote: > > > Mike, > > > > concisely put i want to have YYYY/mm/DD/HH/MM(in the path) to be the > > time-stamp of agent running the HDFSEventSink. > > Current code uses timestamp header which is injected by client > lib(running > > on a different box) which doesn't work for me. > > > > - inder > > > > On Fri, Apr 13, 2012 at 12:45 PM, Mike Percy <[email protected]> > wrote: > > > >> Hi Inder, > >> Can you briefly summarize what you want to do and what is missing from > >> flume for you to do it? > >> > >> Seems like you could store in a structure like this with static configs: > >> flume-data/YYYY/mm/DD/HH/MM/<streamName>.<collectorName>.<filename>.gz > >> > >> As you mentioned, you would have to use one HDFS sink per > stream/collector > >> pair to define this statically in the config files. > >> > >> Is the problem that you want events to be strictly contained in a log > file > >> named according to their internal timestamp? Is it not acceptable to go > by > >> event delivery time at the agent? > >> > >> Best, > >> Mike > >> > >> On Apr 12, 2012, at 12:33 AM, Inder Pall wrote: > >> > >>> Mike and Hari, > >>> > >>> Appreciate your prompt and detailed responses. > >>> > >>> 1. For timestamp header at agent - OOZIE based consumer work flows > wait > >>> for data in a directory structure like -* > >> /flume-data/YYYY/MM/DD/HH/MN/.*.we > >>> have a *contract* - a minute level directory can be consumed(is > >> *immutable*) > >>> once the next minute directory is available. If "*timestamp*" injected > >> by * > >>> clientLib* is used it's difficult to guarantee this contract *(messages > >>> coming late, clocks not synchronized, etc)*. > >>> > >>> Mike, specifically the configuration i was planning for is > >>> ((N) ClientLib *=>* (N) event-generating agents (Avro Source + > AvroSink)) > >>> => (M < N) collector agents (AvroSource + HDFSEventSink) > >>> > >>> 2. I agree that the agentPath use-case can be supported without headers > >>> through a separate HDFSEventSink configuration. This will ensure > >> different > >>> agent's write to different path's(t*hereby avoiding any critical > >>> section)*issues. > >>> > >>> Reason for asking was to avoid directory structure like - > >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > >>> ...................................... > >>> ...................................... > >>> */flume-data/<collector1>/YYYY/MM/DD/HH/MN* > >>> and instead have > >>> > */flume-data/YYYY/MM/DD/HH/MN*/<collectorName>-<streamName>-<FileName>.gz > >>> (*Adding > >>> the collectorName avoids multiple folks writing to the same file > issue*) > >>> > >>> However it's tough to obey the above mentioned contract - collector1 > has > >>> moved forward to a new directory and collector2 is still writing to the > >> old > >>> minute directory. Just wanted to avoid the additional hop of moving > data > >>> from collector specific directories to one unified location, though i > can > >>> live with it. > >>> > >>> I don't want to do something specific here and end maintaining a > >> different > >>> version of FLUME :( > >>> Let me know what you guys think, i believe as the adoption grows so > will > >>> use-cases which require adding/modifying headers at avroSource. > >>> > >>> Looking forward to hearing from you folks > >>> > >>> > >>> - Inder > >>> On Wed, Apr 11, 2012 at 11:43 PM, Mike Percy <[email protected]> > >> wrote: > >>> > >>>> Well, we almost support #1, although the way to do it is pass a > >>>> "timestamp" header at the first hop. Then you can use the BucketPath > >>>> shorthand stuff to name the hdfs.path according to this spec (except > for > >>>> the agent-hostname thing). > >>>> > >>>> With #2 it seems reasonable to add support for an arbitrary "tag" > header > >>>> or something like that which one could use in the hdfs.path as well. > >> But it > >>>> would have to come from the first-hop agent at this point. The tag > could > >>>> take the place of the hostname. > >>>> > >>>> Something that might get Flume closer to the below vision without > >> hacking > >>>> the core is adding support for a plugin interface to AvroSource which > >> can > >>>> annotate headers. However I worry that people might take this and try > >> to do > >>>> all kinds of parsing and whatnot. So I think the first cut should only > >>>> support reading & setting headers. This is basically a "routing" > feature > >>>> which I would argue Flume needs to be good at and flexible for. > >>>> > >>>> Just in case I misinterpreted the use case, I want to make sure we are > >> not > >>>> trying to have multiple HDFSEventSink agents append to the same HDFS > >> file > >>>> simultaneously, since I am pretty sure Hadoop doesn't support that. > >>>> > >>>> Inder, just to clarify, is this what you are doing? > >>>> > >>>> (N) event-generating agents (Custom Source + AvroSink) => (M < N) > >>>> collector agents (AvroSource + AvroSink) => Load-Balancing VIP => > >>>> (AvroSource + HDFSEventSink) => HDFS > >>>> > >>>> Best, > >>>> Mike > >>>> > >>>> On Apr 11, 2012, at 9:55 AM, Hari Shreedharan wrote: > >>>> > >>>>> Hi Inder, > >>>>> > >>>>> I think these use cases are quite specific to your requirements. Even > >>>> though I did not clearly understand (2), I think that can be addressed > >>>> through configuration, and you would not need to add any new code for > >> that. > >>>> I don't understand why you would want to inject a header in that case. > >> You > >>>> can simply have different configurations for each of the agents, with > >>>> different sink paths. So agent A would have a sink configured to write > >> to > >>>> /flume-data/agenta/.… and so on. > >>>>> > >>>>> I don't think we have support for something like (1) as of now. It > does > >>>> not look like something which is very generic, and have not heard of > >>>> someone else having such a requirement. If you want this, the only > way I > >>>> can see it, is to pick up AvroSource and add this support, and make it > >>>> configurable(on/off switch in the conf). > >>>>> > >>>>> Thanks > >>>>> Hari > >>>>> > >>>>> -- > >>>>> Hari Shreedharan > >>>>> > >>>>> > >>>>> On Wednesday, April 11, 2012 at 4:26 AM, Inder Pall wrote: > >>>>> > >>>>>> Folks, > >>>>>> > >>>>>> i have two use-cases and both seem to be landing under this > >> requirement > >>>>>> > >>>>>> 1. Support to publish files in HDFS in /flume-data/YYYY/MM/DD/HH/MN. > >>>>>> Timestamp is the arrival time on this agent. > >>>>>>>> Can be addressed by passing timestamp" in HEADERS of event. Caveat > >> is > >>>> i > >>>>>>> > >>>>>> > >>>>>> want to pass this header at the final agent in pipeline. > >>>>>> 2. Have multiple flume agents configured behind a VIP writing to the > >>>> same > >>>>>> HDFS sink path. > >>>>>>>> One of the way's is to have the path like - > >>>>>>> > >>>>>> > >>>>>> /flume-data/<flume-agent-hostname>/YYYY/MM/DD/HH/MN > >>>>>> Again can be addressed by passing a header "hostname" at flume agent > >> and > >>>>>> configuring the sink path appropriately. > >>>>>> > >>>>>> Would appreciate any help on how to address this in a generic way in > >>>> FLUME. > >>>>>> Seems to be a generic use-case for anyone planning to take FLUME to > >>>>>> production. > >>>>>> > >>>>>> -- > >>>>>> Thanks, > >>>>>> - Inder > >>>>>> Tech Platforms @Inmobi > >>>>>> Linkedin - http://goo.gl/eR4Ub > >>>>>> > >>>>>> > >>>>> > >>>>> > >>>> > >>>> > >>> > >>> > >>> -- > >>> Thanks, > >>> - Inder > >>> Tech Platforms @Inmobi > >>> Linkedin - http://goo.gl/eR4Ub > >> > >> > > > > > > -- > > Thanks, > > - Inder > > Tech Platforms @Inmobi > > Linkedin - http://goo.gl/eR4Ub > > -- Thanks, - Inder Tech Platforms @Inmobi Linkedin - http://goo.gl/eR4Ub
