I apologize, this was intended for the Flume mailing list. Sorry about that!
*Devin Suiter* Jr. Data Solutions Software Engineer 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 Google Voice: 412-256-8556 | www.rdx.com On Wed, Oct 30, 2013 at 11:04 AM, DSuiter RDX <[email protected]> wrote: > Hi, just a general behavioral question. > > We have a syslogTCP source catching remotely generated syslog events. They > got to an Avro sink, which delivers them to an Avro source, then into an > HDFS sink. > > I currently have a test replicating channel delivering it to HDFS with the > avro_event serializer, and also delivering the same events to HDFS without > the avro_event serializer. The latter results in a text-encoded aggregate > file, which works well. > > The issue I would like clarification on is this: > > When it is saved to HDFS as Avro, there is a epoch timestamp, the > hostname, and some severity and facility information being saved along with > the message body. There is a "headers" and "body" section of the Avro > schema, and the timestamp etc is in the "headers" section, and the actual > text is the "body." > > However, when the file is saved to HDFS as text, the only thing we get is > the content of the "body" field, and there is no longer any host, > timestamp, etc., even though those are components of the original message. > > Where are the components form the generating server being stripped away? > By syslogTCP source, or by HDFS sink deserializing into text? > > Another way to summarize this is: When the server writing the events to > syslog writes them, it writes with timestamp and host fields. If we use > Avro the whole way, it keeps that information as headers, but if we save as > text, no timestamp or host information is preserved. We would like it > preserved so we can programmatically parse the timestamp to sort by day. We > would also like to not have to deal with Avro MapReduce for the time being, > as that has proved challenging. So, is there a way that I can get the WHOLE > event body as the "body" using syslogTCP source, or do we need to look at > exec source to tail the generating server /var/log/messages and send it > that way? > > Thanks, > *Devin Suiter* > Jr. Data Solutions Software Engineer > 100 Sandusky Street | 2nd Floor | Pittsburgh, PA 15212 > Google Voice: 412-256-8556 | www.rdx.com >
