FYI, I've just added a new morphline command that returns Geolocation information for a given IP address, using an efficient in-memory Maxmind database lookup - https://issues.cloudera.org/browse/CDK-227
This can then be used in the MorphlineInterceptor or Morphline Sink. Wolfgang. On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote: > Thank you everyone for your very constructive feedbacks. They were very > helpful. > > To provide some background, most of these suggestions have been inspired by > features I have found in Logstash [3]. > > I am going to spend more time to understand how the cdk morphline commands > [4] work because I think it will really help with the transformation utils > needed in FileSource. > > Regarding the GrokInterceptor, I was not aware of the existence of > MorphlineInterceptor. It already does what I was proposing with > GrokInterceptor. So we are cool from that end. > > In simple standalone tests, the commons-io class that I am planning to use > for the FileSource handles file rotations well but I have not tested > renames or removals yet. > > Regarding the GeoIPInterceptor we can provide links for downloading the > Maxmind database seperately without bundling the IP database with Flume > releases. > > This is how the Logstash project does it. > > Because of the large number of events expected, I was planning to use > Lucene because of the speed of executing range queries from trie indexing > [5] and the results can also be cached in-memory if they have been > previously executed. > > I can perform some benchmarks with and without Lucene and see if the > performance differences justify using it for the lookups. > > My gut feeling is that using Lucene will lead to shorter processing times > as the volume of events increase. > > The RedisSource and RedisSink features will just be simple sources and > sinks. The sink will push [1] events to the Redis server and the source > will do a blocking pop [2] as it waits for new events to occur on the Redis > Server. > > I am still trying out a few things, this part is not yet finalized. > > Regarding contributing features as plugins, how are plugins typically > contributed and managed? > > Do I have to create github repo and manage it independently or are they > contributed as patches to the Flume project? > > [1] http://redis.io/commands/rpush > [2] http://redis.io/commands/blpop > [3] http://logstash.net/docs/1.2.1/ > [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html > [5] > http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html > > *Author and Instructor for the Upcoming Book and Lecture Series* > *Massive Log Data Aggregation, Processing, Searching and Visualization with > Open Source Software* > *http://massivelogdata.com* > > > On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek > <whosc...@cloudera.com>wrote: > >> Re: GrokInterceptor >> >> This functionality is already available in the form of the Apache Flume >> MorphlineInterceptor [1] with the grok command [2]. While grok is very >> useful, consider that grok alone often isn't enough - you typically need >> some other log event processing commands as well, for example as contained >> in morphlines [3]. >> >> Re: FileSource >> >> True file tailing would be great. >> >> Merging multiple lines into one event can already be done with the >> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a >> morphline directly into that new FileSource? >> >> Re: GeoIPInterceptor >> >> Seems to me that it would be more flexible, powerful and reusable to add >> this kind of functionality as a morphline command - contributions welcome! >> >> Finally, a word of caution, Maxmind is a good geo db, and I've used it >> before, but it has some LGPL issues that may or may not be workable in this >> context. Maxmind db fits into RAM - Lucene seems like overkill here - you >> can do fast maxmind lookups directly without Lucene. >> >> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor >> [2] >> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok >> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html >> [4] >> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine >> >> Wolfgang. >> >>> >>> *FileSource* >>> >>> Using the Tailer feature from Apache Commons I/O utility [1], we can tail >>> specific files for events. >>> >>> This allows us to, regardless of the operating system, have the ability >> to >>> watch files for future events as they occur. >>> >>> It also allows us to step in and determine if two or more events should >> be >>> merged into one events if newline characters are present in an event. >>> >>> We can configure certain regular expressions that determines if a >> specific >>> line is a new event or part of the prevent event. >>> >>> Essentially, this source will have the ability to merge multiple lines >> into >>> one event before it is passed on to interceptors. >>> >>> It has been complicated group multiple lines into a single event with the >>> Spooling Directory Source or Exec Source. I tried creating custom >>> deserializers but it was hard to get around the logic used to parse the >>> files. >>> >>> Using the Spooling Directory also means we cannot watch the original >> files >>> so we need a background process to copy over the log files into the >>> spooling directory which requires additional setup. >>> >>> The tail command is not also available on all operating systems out of >> the >>> box. >>> >>> >>> *GrokInterceptor* >>> >>> With this interceptor we can parse semi-structure and unstructured text >> and >>> log data in the headers and body of the event into something structured >>> that can be easily queried. >>> I plan to use the information [2] and [3] for this. >>> With this interceptor, we can extract HTTP response codes, response >> times, >>> user agents, IP addresses and a whole bunch of useful data point from >> free >>> form text. >>> >>> >>> >>> *GeoIPInterceptor* >>> >>> This is for IP intelligence. >>> >>> This interceptor will allow us to use the value of an IP address in the >>> event header or body of the request to estimate the geographical location >>> of the IP address. >>> >>> Using the database available here [4], we can inject the two-letter code >> or >>> country name of the IP address into the event. >>> >>> We can also deduce other values such as city name, postalCode, latitude, >>> longitude, Internet Service Provider and Organization name. >>> >>> This can be very helpful in analyzing traffic patterns and target >> audience >>> from webserver or application logs. >>> >>> The database is loaded into a Lucene index when the agent is started up. >>> The index is only created once if it does not already exists. >>> >>> As the interceptor comes across events, it maps the IP address to a >> variety >>> of values that can be injected into the events. >>> >>> >>> >>> *RedisSink* >>> >>> This can provide another option for setting up a fan-in and/or fan-out >>> architecture. >>> >>> The RedisSink can serve as a queue that is used as a source by another >>> agent down the line. >>> >>> *References* >>> [1] >>> >> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html >>> [2] https://github.com/NFLabs/java-grok >>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/ >>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads >>> [5] http://dev.maxmind.com/geoip/legacy/csv/ >>> [6] http://redis.io/documentation >>> [7] https://github.com/xetorthio/jedis >>> >>> *Author and Instructor for the Upcoming Book and Lecture Series* >>> *Massive Log Data Aggregation, Processing, Searching and Visualization >> with >>> Open Source Software* >>> *http://massivelogdata.com* >> >>