Hi, Don't want to beat a dead horse, but I just stumbled upon this email. Note how Israel wasn't aware of MorphlineInterceptor when he suggested GrokInterceptor. I think that's because MorphlineInterceptor lives under the org.apache.flume.sink.*solr*.... package.
See: * http://search-hadoop.com/m/0jVep1J1hJL&subj=MorphlineInterceptor+questions * http://search-hadoop.com/m/23imV1tSCQK1&subj=Questions+about+Morphline+Solr+Sink+structure Otis -- Performance Monitoring * Log Analytics * Search Analytics Solr & Elasticsearch Support * http://sematext.com/ On Sat, Nov 16, 2013 at 5:26 AM, Wolfgang Hoschek <[email protected]>wrote: > FYI, I've just added a new morphline command that returns Geolocation > information for a given IP address, using an efficient in-memory Maxmind > database lookup - https://issues.cloudera.org/browse/CDK-227 > > This can then be used in the MorphlineInterceptor or Morphline Sink. > > Wolfgang. > > On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote: > > > Thank you everyone for your very constructive feedbacks. They were very > > helpful. > > > > To provide some background, most of these suggestions have been inspired > by > > features I have found in Logstash [3]. > > > > I am going to spend more time to understand how the cdk morphline > commands > > [4] work because I think it will really help with the transformation > utils > > needed in FileSource. > > > > Regarding the GrokInterceptor, I was not aware of the existence of > > MorphlineInterceptor. It already does what I was proposing with > > GrokInterceptor. So we are cool from that end. > > > > In simple standalone tests, the commons-io class that I am planning to > use > > for the FileSource handles file rotations well but I have not tested > > renames or removals yet. > > > > Regarding the GeoIPInterceptor we can provide links for downloading the > > Maxmind database seperately without bundling the IP database with Flume > > releases. > > > > This is how the Logstash project does it. > > > > Because of the large number of events expected, I was planning to use > > Lucene because of the speed of executing range queries from trie indexing > > [5] and the results can also be cached in-memory if they have been > > previously executed. > > > > I can perform some benchmarks with and without Lucene and see if the > > performance differences justify using it for the lookups. > > > > My gut feeling is that using Lucene will lead to shorter processing times > > as the volume of events increase. > > > > The RedisSource and RedisSink features will just be simple sources and > > sinks. The sink will push [1] events to the Redis server and the source > > will do a blocking pop [2] as it waits for new events to occur on the > Redis > > Server. > > > > I am still trying out a few things, this part is not yet finalized. > > > > Regarding contributing features as plugins, how are plugins typically > > contributed and managed? > > > > Do I have to create github repo and manage it independently or are they > > contributed as patches to the Flume project? > > > > [1] http://redis.io/commands/rpush > > [2] http://redis.io/commands/blpop > > [3] http://logstash.net/docs/1.2.1/ > > [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html > > [5] > > > http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html > > > > *Author and Instructor for the Upcoming Book and Lecture Series* > > *Massive Log Data Aggregation, Processing, Searching and Visualization > with > > Open Source Software* > > *http://massivelogdata.com* > > > > > > On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[email protected] > >wrote: > > > >> Re: GrokInterceptor > >> > >> This functionality is already available in the form of the Apache Flume > >> MorphlineInterceptor [1] with the grok command [2]. While grok is very > >> useful, consider that grok alone often isn't enough - you typically need > >> some other log event processing commands as well, for example as > contained > >> in morphlines [3]. > >> > >> Re: FileSource > >> > >> True file tailing would be great. > >> > >> Merging multiple lines into one event can already be done with the > >> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed > a > >> morphline directly into that new FileSource? > >> > >> Re: GeoIPInterceptor > >> > >> Seems to me that it would be more flexible, powerful and reusable to add > >> this kind of functionality as a morphline command - contributions > welcome! > >> > >> Finally, a word of caution, Maxmind is a good geo db, and I've used it > >> before, but it has some LGPL issues that may or may not be workable in > this > >> context. Maxmind db fits into RAM - Lucene seems like overkill here - > you > >> can do fast maxmind lookups directly without Lucene. > >> > >> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor > >> [2] > >> > http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok > >> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html > >> [4] > >> > http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine > >> > >> Wolfgang. > >> > >>> > >>> *FileSource* > >>> > >>> Using the Tailer feature from Apache Commons I/O utility [1], we can > tail > >>> specific files for events. > >>> > >>> This allows us to, regardless of the operating system, have the ability > >> to > >>> watch files for future events as they occur. > >>> > >>> It also allows us to step in and determine if two or more events should > >> be > >>> merged into one events if newline characters are present in an event. > >>> > >>> We can configure certain regular expressions that determines if a > >> specific > >>> line is a new event or part of the prevent event. > >>> > >>> Essentially, this source will have the ability to merge multiple lines > >> into > >>> one event before it is passed on to interceptors. > >>> > >>> It has been complicated group multiple lines into a single event with > the > >>> Spooling Directory Source or Exec Source. I tried creating custom > >>> deserializers but it was hard to get around the logic used to parse the > >>> files. > >>> > >>> Using the Spooling Directory also means we cannot watch the original > >> files > >>> so we need a background process to copy over the log files into the > >>> spooling directory which requires additional setup. > >>> > >>> The tail command is not also available on all operating systems out of > >> the > >>> box. > >>> > >>> > >>> *GrokInterceptor* > >>> > >>> With this interceptor we can parse semi-structure and unstructured text > >> and > >>> log data in the headers and body of the event into something structured > >>> that can be easily queried. > >>> I plan to use the information [2] and [3] for this. > >>> With this interceptor, we can extract HTTP response codes, response > >> times, > >>> user agents, IP addresses and a whole bunch of useful data point from > >> free > >>> form text. > >>> > >>> > >>> > >>> *GeoIPInterceptor* > >>> > >>> This is for IP intelligence. > >>> > >>> This interceptor will allow us to use the value of an IP address in the > >>> event header or body of the request to estimate the geographical > location > >>> of the IP address. > >>> > >>> Using the database available here [4], we can inject the two-letter > code > >> or > >>> country name of the IP address into the event. > >>> > >>> We can also deduce other values such as city name, postalCode, > latitude, > >>> longitude, Internet Service Provider and Organization name. > >>> > >>> This can be very helpful in analyzing traffic patterns and target > >> audience > >>> from webserver or application logs. > >>> > >>> The database is loaded into a Lucene index when the agent is started > up. > >>> The index is only created once if it does not already exists. > >>> > >>> As the interceptor comes across events, it maps the IP address to a > >> variety > >>> of values that can be injected into the events. > >>> > >>> > >>> > >>> *RedisSink* > >>> > >>> This can provide another option for setting up a fan-in and/or fan-out > >>> architecture. > >>> > >>> The RedisSink can serve as a queue that is used as a source by another > >>> agent down the line. > >>> > >>> *References* > >>> [1] > >>> > >> > http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html > >>> [2] https://github.com/NFLabs/java-grok > >>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/ > >>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads > >>> [5] http://dev.maxmind.com/geoip/legacy/csv/ > >>> [6] http://redis.io/documentation > >>> [7] https://github.com/xetorthio/jedis > >>> > >>> *Author and Instructor for the Upcoming Book and Lecture Series* > >>> *Massive Log Data Aggregation, Processing, Searching and Visualization > >> with > >>> Open Source Software* > >>> *http://massivelogdata.com* > >> > >> > >
