Re: New Features Proposed for Apache Flume

Mike Percy Wed, 28 Aug 2013 11:27:58 -0700

> Re: FileSource
> 
> True file tailing would be great.

A pure Java implementation is not possible... It needs to be aware of inodes to 
be reliable.


Mike


> 
> Merging multiple lines into one event can already be done with the 
> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a 
> morphline directly into that new FileSource?
> 
> Re: GeoIPInterceptor
> 
> Seems to me that it would be more flexible, powerful and reusable to add this 
> kind of functionality as a morphline command - contributions welcome!
> 
> Finally, a word of caution, Maxmind is a good geo db, and I've used it 
> before, but it has some LGPL issues that may or may not be workable in this 
> context. Maxmind db fits into RAM - Lucene seems like overkill here - you can 
> do fast maxmind lookups directly without Lucene.
> 
> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> [2] 
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [4] 
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
> 
> Wolfgang.
> 
>> 
>> *FileSource*
>> 
>> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
>> specific files for events.
>> 
>> This allows us to, regardless of the operating system, have the ability to
>> watch files for future events as they occur.
>> 
>> It also allows us to step in and determine if two or more events should be
>> merged into one events if newline characters are present in an event.
>> 
>> We can configure certain regular expressions that determines if a specific
>> line is a new event or part of the prevent event.
>> 
>> Essentially, this source will have the ability to merge multiple lines into
>> one event before it is passed on to interceptors.
>> 
>> It has been complicated group multiple lines into a single event with the
>> Spooling Directory Source or Exec Source. I tried creating custom
>> deserializers but it was hard to get around the logic used to parse the
>> files.
>> 
>> Using the Spooling Directory also means we cannot watch the original files
>> so we need a background process to copy over the log files into the
>> spooling directory which requires additional setup.
>> 
>> The tail command is not also available on all operating systems out of the
>> box.
>> 
>> 
>> *GrokInterceptor*
>> 
>> With this interceptor we can parse semi-structure and unstructured text and
>> log data in the headers and body of the event into something structured
>> that can be easily queried.
>> I plan to use the information [2] and [3] for this.
>> With this interceptor, we can extract HTTP response codes, response times,
>> user agents, IP addresses and a whole bunch of useful data point from free
>> form text.
>> 
>> 
>> 
>> *GeoIPInterceptor*
>> 
>> This is for IP intelligence.
>> 
>> This interceptor will allow us to use the value of an IP address in the
>> event header or body of the request to estimate the geographical location
>> of the IP address.
>> 
>> Using the database available here [4], we can inject the two-letter code or
>> country name of the IP address into the event.
>> 
>> We can also deduce other values such as city name, postalCode, latitude,
>> longitude, Internet Service Provider and Organization name.
>> 
>> This can be very helpful in analyzing traffic patterns and target audience
>> from webserver or application logs.
>> 
>> The database is loaded into a Lucene index when the agent is started up.
>> The index is only created once if it does not already exists.
>> 
>> As the interceptor comes across events, it maps the IP address to a variety
>> of values that can be injected into the events.
>> 
>> 
>> 
>> *RedisSink*
>> 
>> This can provide another option for setting up a fan-in and/or fan-out
>> architecture.
>> 
>> The RedisSink can serve as a queue that is used as a source by another
>> agent down the line.
>> 
>> *References*
>> [1]
>> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
>> [2] https://github.com/NFLabs/java-grok
>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
>> [5] http://dev.maxmind.com/geoip/legacy/csv/
>> [6] http://redis.io/documentation
>> [7] https://github.com/xetorthio/jedis
>> 
>> *Author and Instructor for the Upcoming Book and Lecture Series*
>> *Massive Log Data Aggregation, Processing, Searching and Visualization with
>> Open Source Software*
>> *http://massivelogdata.com*
>

Re: New Features Proposed for Apache Flume

Reply via email to