> Re: FileSource > > True file tailing would be great. A pure Java implementation is not possible... It needs to be aware of inodes to be reliable.
Mike > > Merging multiple lines into one event can already be done with the > MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed a > morphline directly into that new FileSource? > > Re: GeoIPInterceptor > > Seems to me that it would be more flexible, powerful and reusable to add this > kind of functionality as a morphline command - contributions welcome! > > Finally, a word of caution, Maxmind is a good geo db, and I've used it > before, but it has some LGPL issues that may or may not be workable in this > context. Maxmind db fits into RAM - Lucene seems like overkill here - you can > do fast maxmind lookups directly without Lucene. > > [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor > [2] > http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok > [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html > [4] > http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine > > Wolfgang. > >> >> *FileSource* >> >> Using the Tailer feature from Apache Commons I/O utility [1], we can tail >> specific files for events. >> >> This allows us to, regardless of the operating system, have the ability to >> watch files for future events as they occur. >> >> It also allows us to step in and determine if two or more events should be >> merged into one events if newline characters are present in an event. >> >> We can configure certain regular expressions that determines if a specific >> line is a new event or part of the prevent event. >> >> Essentially, this source will have the ability to merge multiple lines into >> one event before it is passed on to interceptors. >> >> It has been complicated group multiple lines into a single event with the >> Spooling Directory Source or Exec Source. I tried creating custom >> deserializers but it was hard to get around the logic used to parse the >> files. >> >> Using the Spooling Directory also means we cannot watch the original files >> so we need a background process to copy over the log files into the >> spooling directory which requires additional setup. >> >> The tail command is not also available on all operating systems out of the >> box. >> >> >> *GrokInterceptor* >> >> With this interceptor we can parse semi-structure and unstructured text and >> log data in the headers and body of the event into something structured >> that can be easily queried. >> I plan to use the information [2] and [3] for this. >> With this interceptor, we can extract HTTP response codes, response times, >> user agents, IP addresses and a whole bunch of useful data point from free >> form text. >> >> >> >> *GeoIPInterceptor* >> >> This is for IP intelligence. >> >> This interceptor will allow us to use the value of an IP address in the >> event header or body of the request to estimate the geographical location >> of the IP address. >> >> Using the database available here [4], we can inject the two-letter code or >> country name of the IP address into the event. >> >> We can also deduce other values such as city name, postalCode, latitude, >> longitude, Internet Service Provider and Organization name. >> >> This can be very helpful in analyzing traffic patterns and target audience >> from webserver or application logs. >> >> The database is loaded into a Lucene index when the agent is started up. >> The index is only created once if it does not already exists. >> >> As the interceptor comes across events, it maps the IP address to a variety >> of values that can be injected into the events. >> >> >> >> *RedisSink* >> >> This can provide another option for setting up a fan-in and/or fan-out >> architecture. >> >> The RedisSink can serve as a queue that is used as a source by another >> agent down the line. >> >> *References* >> [1] >> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html >> [2] https://github.com/NFLabs/java-grok >> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/ >> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads >> [5] http://dev.maxmind.com/geoip/legacy/csv/ >> [6] http://redis.io/documentation >> [7] https://github.com/xetorthio/jedis >> >> *Author and Instructor for the Upcoming Book and Lecture Series* >> *Massive Log Data Aggregation, Processing, Searching and Visualization with >> Open Source Software* >> *http://massivelogdata.com* >
