Re: New Features Proposed for Apache Flume

Wolfgang Hoschek Wed, 28 Aug 2013 10:24:36 -0700

Re: GrokInterceptor

This functionality is already available in the form of the Apache Flume 
MorphlineInterceptor [1] with the grok command [2]. While grok is very useful, 
consider that grok alone often isn't enough - you typically need some other log 
event processing commands as well, for example as contained in morphlines [3].


Re: FileSource

True file tailing would be great. 

Merging multiple lines into one event can already be done with the 
MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a 
morphline directly into that new FileSource?

Re: GeoIPInterceptor

Seems to me that it would be more flexible, powerful and reusable to add this 
kind of functionality as a morphline command - contributions welcome!

Finally, a word of caution, Maxmind is a good geo db, and I've used it before, 
but it has some LGPL issues that may or may not be workable in this context. 
Maxmind db fits into RAM - Lucene seems like overkill here - you can do fast 
maxmind lookups directly without Lucene.

[1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
[2] 
http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
[3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
[4] 
http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine

Wolfgang.

> 
> *FileSource*
> 
> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
> specific files for events.
> 
> This allows us to, regardless of the operating system, have the ability to
> watch files for future events as they occur.
> 
> It also allows us to step in and determine if two or more events should be
> merged into one events if newline characters are present in an event.
> 
> We can configure certain regular expressions that determines if a specific
> line is a new event or part of the prevent event.
> 
> Essentially, this source will have the ability to merge multiple lines into
> one event before it is passed on to interceptors.
> 
> It has been complicated group multiple lines into a single event with the
> Spooling Directory Source or Exec Source. I tried creating custom
> deserializers but it was hard to get around the logic used to parse the
> files.
> 
> Using the Spooling Directory also means we cannot watch the original files
> so we need a background process to copy over the log files into the
> spooling directory which requires additional setup.
> 
> The tail command is not also available on all operating systems out of the
> box.
> 
> 
> *GrokInterceptor*
> 
> With this interceptor we can parse semi-structure and unstructured text and
> log data in the headers and body of the event into something structured
> that can be easily queried.
> I plan to use the information [2] and [3] for this.
> With this interceptor, we can extract HTTP response codes, response times,
> user agents, IP addresses and a whole bunch of useful data point from free
> form text.
> 
> 
> 
> *GeoIPInterceptor*
> 
> This is for IP intelligence.
> 
> This interceptor will allow us to use the value of an IP address in the
> event header or body of the request to estimate the geographical location
> of the IP address.
> 
> Using the database available here [4], we can inject the two-letter code or
> country name of the IP address into the event.
> 
> We can also deduce other values such as city name, postalCode, latitude,
> longitude, Internet Service Provider and Organization name.
> 
> This can be very helpful in analyzing traffic patterns and target audience
> from webserver or application logs.
> 
> The database is loaded into a Lucene index when the agent is started up.
> The index is only created once if it does not already exists.
> 
> As the interceptor comes across events, it maps the IP address to a variety
> of values that can be injected into the events.
> 
> 
> 
> *RedisSink*
> 
> This can provide another option for setting up a fan-in and/or fan-out
> architecture.
> 
> The RedisSink can serve as a queue that is used as a source by another
> agent down the line.
> 
> *References*
> [1]
> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
> [2] https://github.com/NFLabs/java-grok
> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
> [5] http://dev.maxmind.com/geoip/legacy/csv/
> [6] http://redis.io/documentation
> [7] https://github.com/xetorthio/jedis
> 
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*

Re: New Features Proposed for Apache Flume

Reply via email to