Re: GrokInterceptor This functionality is already available in the form of the Apache Flume MorphlineInterceptor [1] with the grok command [2]. While grok is very useful, consider that grok alone often isn't enough - you typically need some other log event processing commands as well, for example as contained in morphlines [3].
Re: FileSource True file tailing would be great. Merging multiple lines into one event can already be done with the MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a morphline directly into that new FileSource? Re: GeoIPInterceptor Seems to me that it would be more flexible, powerful and reusable to add this kind of functionality as a morphline command - contributions welcome! Finally, a word of caution, Maxmind is a good geo db, and I've used it before, but it has some LGPL issues that may or may not be workable in this context. Maxmind db fits into RAM - Lucene seems like overkill here - you can do fast maxmind lookups directly without Lucene. [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor [2] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine Wolfgang. > > *FileSource* > > Using the Tailer feature from Apache Commons I/O utility [1], we can tail > specific files for events. > > This allows us to, regardless of the operating system, have the ability to > watch files for future events as they occur. > > It also allows us to step in and determine if two or more events should be > merged into one events if newline characters are present in an event. > > We can configure certain regular expressions that determines if a specific > line is a new event or part of the prevent event. > > Essentially, this source will have the ability to merge multiple lines into > one event before it is passed on to interceptors. > > It has been complicated group multiple lines into a single event with the > Spooling Directory Source or Exec Source. I tried creating custom > deserializers but it was hard to get around the logic used to parse the > files. > > Using the Spooling Directory also means we cannot watch the original files > so we need a background process to copy over the log files into the > spooling directory which requires additional setup. > > The tail command is not also available on all operating systems out of the > box. > > > *GrokInterceptor* > > With this interceptor we can parse semi-structure and unstructured text and > log data in the headers and body of the event into something structured > that can be easily queried. > I plan to use the information [2] and [3] for this. > With this interceptor, we can extract HTTP response codes, response times, > user agents, IP addresses and a whole bunch of useful data point from free > form text. > > > > *GeoIPInterceptor* > > This is for IP intelligence. > > This interceptor will allow us to use the value of an IP address in the > event header or body of the request to estimate the geographical location > of the IP address. > > Using the database available here [4], we can inject the two-letter code or > country name of the IP address into the event. > > We can also deduce other values such as city name, postalCode, latitude, > longitude, Internet Service Provider and Organization name. > > This can be very helpful in analyzing traffic patterns and target audience > from webserver or application logs. > > The database is loaded into a Lucene index when the agent is started up. > The index is only created once if it does not already exists. > > As the interceptor comes across events, it maps the IP address to a variety > of values that can be injected into the events. > > > > *RedisSink* > > This can provide another option for setting up a fan-in and/or fan-out > architecture. > > The RedisSink can serve as a queue that is used as a source by another > agent down the line. > > *References* > [1] > http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html > [2] https://github.com/NFLabs/java-grok > [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/ > [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads > [5] http://dev.maxmind.com/geoip/legacy/csv/ > [6] http://redis.io/documentation > [7] https://github.com/xetorthio/jedis > > *Author and Instructor for the Upcoming Book and Lecture Series* > *Massive Log Data Aggregation, Processing, Searching and Visualization with > Open Source Software* > *http://massivelogdata.com*