Inline below. On Mon, Sep 22, 2014 at 7:25 PM, Hanish Bansal < [email protected]> wrote:
> Thanks for reply !! > > According to architecture multiple agents will run(on each node one agent) > and all agents will send log data(as events) to common collector, and that > collector will write data to hdfs. > > Here I have some questions: > How collector will receive the data from agents and write to hdfs? > Agents are typically connected by having the AvroSink of one agent connect to the AvroSource of the next agent in the chain. > Does data coming from different agents is mixed up with each other since > collector will also write the data to hdfs in order as it arrives to > collector. > Data will enter the collector's channel in the order they arrive. Flume doesn't do any re-ordering of data. > Can we use this logging feature for real time logging? We can accept this > if all logs are in same order in hdfs as these were generated from > application. > > It depends on your requirements. If you need events to be sorted by timestamp, then writing directly out of Flume to HDFS is not sufficient. If you want your data in HDFS, then the best bet is to write to a partitioned Dataset and run a batch job to sort partitions. For example, if you partitioned by hour, you'd have an hourly job to sort the last hours worth of data. Alternatively, you could write the data to HBase which can sort the data by timestamp during ingest. In either case, Kite would make it easier for you to integrate with the underlying store as it can write to both HDFS and HBase. -Joey > Regards > Hanish > On 22/09/2014 10:12 pm, "Joey Echeverria" <[email protected]> wrote: > >> Hi Hanish, >> >> The Log4jAppender is designed to connect to a Flume agent running an >> AvroSource. So, you'd configure Flume similar to [1] and then point >> the Log4jAppender to your agent using the log4j properties you linked >> to. >> >> The Log4jAppender will use Avro inspect the object being logged to >> determine it's schema and to serialize it to bytes which becomes the >> body of the events sent to Flume. If you're logging Strings, which is >> most common, then the Schema will just be a Schema.String. There are >> two ways that schema information can be passed. You can configure the >> Log4jAppender with a Schema URL that will be sent in the event headers >> or you can leave that out and a JSON-reperesentation of the Schema >> will be sent as a header with each event. The URL is more efficient as >> it avoids sending extra information with each record, but you can >> leave it out to start your testing. >> >> With regards to your second question, the answer is no. Flume does not >> attempt to re-order events so your logs will appear in arrival order. >> What I would do, is write the data to a partitioned directory >> structure and then have a Crunch job that sorts each partition as it >> closes. >> >> You might consider taking a look at the Kite SDK[2] as we have some >> examples that show how to do the logging[3] and can also handle >> getting the data properly partitioned on HDFS. >> >> HTH, >> >> -Joey >> >> [1] http://flume.apache.org/FlumeUserGuide.html#avro-source >> [2] http://kitesdk.org/docs/current/ >> [3] https://github.com/kite-sdk/kite-examples/tree/snapshot/logging >> >> On Mon, Sep 22, 2014 at 4:21 AM, Hanish Bansal >> <[email protected]> wrote: >> > Hi All, >> > >> > I want to use flume log4j-appender for logging of a map-reduce >> application >> > which is running on different nodes. My use case is have the logs from >> all >> > nodes at centralized location(say HDFS) with time based synchronization. >> > >> > As described in below links Flume has its own appender which can be >> used to >> > logging of an application in HDFS direct from log4j. >> > >> > >> http://www.addsimplicity.com/adding_simplicity_an_engi/2010/10/sending-logs-down-the-flume.html >> > >> > >> > http://flume.apache.org/FlumeUserGuide.html#log4j-appender >> > >> > Could anyone please tell me if these logs are synchronized on >> time-basis or >> > not ? >> > >> > What i mean by time based synchronization is: Having logs from different >> > nodes in sorted order of time. >> > >> > Also could anyone provide a link that describes how flume log4j-appender >> > internally works? >> > >> > >> > -- >> > Thanks & Regards >> > Hanish Bansal >> >> >> >> -- >> Joey Echeverria >> > -- Joey Echeverria
