Re: New Features Proposed for Apache Flume

Wolfgang Hoschek Sat, 16 Nov 2013 02:27:47 -0800

FYI, I've just added a new morphline command that returns Geolocation 
information for a given IP address, using an efficient in-memory Maxmind 
database lookup - https://issues.cloudera.org/browse/CDK-227


This can then be used in the MorphlineInterceptor or Morphline Sink.

Wolfgang.

On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:

> Thank you everyone for your very constructive feedbacks. They were very
> helpful.
> 
> To provide some background, most of these suggestions have been inspired by
> features I have found in Logstash [3].
> 
> I am going to spend more time to understand how the cdk morphline commands
> [4] work because I think it will really help with the transformation utils
> needed in FileSource.
> 
> Regarding the GrokInterceptor, I was not aware of the existence of
> MorphlineInterceptor. It already does what I was proposing with
> GrokInterceptor. So we are cool from that end.
> 
> In simple standalone tests, the commons-io class that I am planning to use
> for the FileSource handles file rotations well but I have not tested
> renames or removals yet.
> 
> Regarding the GeoIPInterceptor we can provide links for downloading the
> Maxmind database seperately without bundling the IP database with Flume
> releases.
> 
> This is how the Logstash project does it.
> 
> Because of the large number of events expected, I was planning to use
> Lucene because of the speed of executing range queries from trie indexing
> [5] and the results can also be cached in-memory if they have been
> previously executed.
> 
> I can perform some benchmarks with and without Lucene and see if the
> performance differences justify using it for the lookups.
> 
> My gut feeling is that using Lucene will lead to shorter processing times
> as the volume of events increase.
> 
> The RedisSource and RedisSink features will just be simple sources and
> sinks. The sink will push [1] events to the Redis server and the source
> will do a blocking pop [2] as it waits for new events to occur on the Redis
> Server.
> 
> I am still trying out a few things, this part is not yet finalized.
> 
> Regarding contributing features as plugins, how are plugins typically
> contributed and managed?
> 
> Do I have to create github repo and manage it independently or are they
> contributed as patches to the Flume project?
> 
> [1] http://redis.io/commands/rpush
> [2] http://redis.io/commands/blpop
> [3] http://logstash.net/docs/1.2.1/
> [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> [5]
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
> 
> *Author and Instructor for the Upcoming Book and Lecture Series*
> *Massive Log Data Aggregation, Processing, Searching and Visualization with
> Open Source Software*
> *http://massivelogdata.com*
> 
> 
> On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek 
> <[email protected]>wrote:
> 
>> Re: GrokInterceptor
>> 
>> This functionality is already available in the form of the Apache Flume
>> MorphlineInterceptor [1] with the grok command [2]. While grok is very
>> useful, consider that grok alone often isn't enough - you typically need
>> some other log event processing commands as well, for example as contained
>> in morphlines [3].
>> 
>> Re: FileSource
>> 
>> True file tailing would be great.
>> 
>> Merging multiple lines into one event can already be done with the
>> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a
>> morphline directly into that new FileSource?
>> 
>> Re: GeoIPInterceptor
>> 
>> Seems to me that it would be more flexible, powerful and reusable to add
>> this kind of functionality as a morphline command - contributions welcome!
>> 
>> Finally, a word of caution, Maxmind is a good geo db, and I've used it
>> before, but it has some LGPL issues that may or may not be workable in this
>> context. Maxmind db fits into RAM - Lucene seems like overkill here - you
>> can do fast maxmind lookups directly without Lucene.
>> 
>> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
>> [2]
>> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
>> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
>> [4]
>> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
>> 
>> Wolfgang.
>> 
>>> 
>>> *FileSource*
>>> 
>>> Using the Tailer feature from Apache Commons I/O utility [1], we can tail
>>> specific files for events.
>>> 
>>> This allows us to, regardless of the operating system, have the ability
>> to
>>> watch files for future events as they occur.
>>> 
>>> It also allows us to step in and determine if two or more events should
>> be
>>> merged into one events if newline characters are present in an event.
>>> 
>>> We can configure certain regular expressions that determines if a
>> specific
>>> line is a new event or part of the prevent event.
>>> 
>>> Essentially, this source will have the ability to merge multiple lines
>> into
>>> one event before it is passed on to interceptors.
>>> 
>>> It has been complicated group multiple lines into a single event with the
>>> Spooling Directory Source or Exec Source. I tried creating custom
>>> deserializers but it was hard to get around the logic used to parse the
>>> files.
>>> 
>>> Using the Spooling Directory also means we cannot watch the original
>> files
>>> so we need a background process to copy over the log files into the
>>> spooling directory which requires additional setup.
>>> 
>>> The tail command is not also available on all operating systems out of
>> the
>>> box.
>>> 
>>> 
>>> *GrokInterceptor*
>>> 
>>> With this interceptor we can parse semi-structure and unstructured text
>> and
>>> log data in the headers and body of the event into something structured
>>> that can be easily queried.
>>> I plan to use the information [2] and [3] for this.
>>> With this interceptor, we can extract HTTP response codes, response
>> times,
>>> user agents, IP addresses and a whole bunch of useful data point from
>> free
>>> form text.
>>> 
>>> 
>>> 
>>> *GeoIPInterceptor*
>>> 
>>> This is for IP intelligence.
>>> 
>>> This interceptor will allow us to use the value of an IP address in the
>>> event header or body of the request to estimate the geographical location
>>> of the IP address.
>>> 
>>> Using the database available here [4], we can inject the two-letter code
>> or
>>> country name of the IP address into the event.
>>> 
>>> We can also deduce other values such as city name, postalCode, latitude,
>>> longitude, Internet Service Provider and Organization name.
>>> 
>>> This can be very helpful in analyzing traffic patterns and target
>> audience
>>> from webserver or application logs.
>>> 
>>> The database is loaded into a Lucene index when the agent is started up.
>>> The index is only created once if it does not already exists.
>>> 
>>> As the interceptor comes across events, it maps the IP address to a
>> variety
>>> of values that can be injected into the events.
>>> 
>>> 
>>> 
>>> *RedisSink*
>>> 
>>> This can provide another option for setting up a fan-in and/or fan-out
>>> architecture.
>>> 
>>> The RedisSink can serve as a queue that is used as a source by another
>>> agent down the line.
>>> 
>>> *References*
>>> [1]
>>> 
>> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
>>> [2] https://github.com/NFLabs/java-grok
>>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
>>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
>>> [5] http://dev.maxmind.com/geoip/legacy/csv/
>>> [6] http://redis.io/documentation
>>> [7] https://github.com/xetorthio/jedis
>>> 
>>> *Author and Instructor for the Upcoming Book and Lecture Series*
>>> *Massive Log Data Aggregation, Processing, Searching and Visualization
>> with
>>> Open Source Software*
>>> *http://massivelogdata.com*
>> 
>>

Re: New Features Proposed for Apache Flume

Reply via email to