Hello everyone, I think it will be helpful to have the following features in Apache Flume:
I plan to open JIRA issues for these proposals tonight. I am about to start creating patches for some of them but I would like to know what you guys think so that I can tweak my logic accordingly without going too far. *When you get a chance, please take a look at them and give me some feedback.* Thanks. *FileSource* Using the Tailer feature from Apache Commons I/O utility [1], we can tail specific files for events. This allows us to, regardless of the operating system, have the ability to watch files for future events as they occur. It also allows us to step in and determine if two or more events should be merged into one events if newline characters are present in an event. We can configure certain regular expressions that determines if a specific line is a new event or part of the prevent event. Essentially, this source will have the ability to merge multiple lines into one event before it is passed on to interceptors. It has been complicated group multiple lines into a single event with the Spooling Directory Source or Exec Source. I tried creating custom deserializers but it was hard to get around the logic used to parse the files. Using the Spooling Directory also means we cannot watch the original files so we need a background process to copy over the log files into the spooling directory which requires additional setup. The tail command is not also available on all operating systems out of the box. *GrokInterceptor* With this interceptor we can parse semi-structure and unstructured text and log data in the headers and body of the event into something structured that can be easily queried. I plan to use the information [2] and [3] for this. With this interceptor, we can extract HTTP response codes, response times, user agents, IP addresses and a whole bunch of useful data point from free form text. *GeoIPInterceptor* This is for IP intelligence. This interceptor will allow us to use the value of an IP address in the event header or body of the request to estimate the geographical location of the IP address. Using the database available here [4], we can inject the two-letter code or country name of the IP address into the event. We can also deduce other values such as city name, postalCode, latitude, longitude, Internet Service Provider and Organization name. This can be very helpful in analyzing traffic patterns and target audience from webserver or application logs. The database is loaded into a Lucene index when the agent is started up. The index is only created once if it does not already exists. As the interceptor comes across events, it maps the IP address to a variety of values that can be injected into the events. *RedisSink* This can provide another option for setting up a fan-in and/or fan-out architecture. The RedisSink can serve as a queue that is used as a source by another agent down the line. *References* [1] http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html [2] https://github.com/NFLabs/java-grok [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/ [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads [5] http://dev.maxmind.com/geoip/legacy/csv/ [6] http://redis.io/documentation [7] https://github.com/xetorthio/jedis *Author and Instructor for the Upcoming Book and Lecture Series* *Massive Log Data Aggregation, Processing, Searching and Visualization with Open Source Software* *http://massivelogdata.com*
