Re: New Features Proposed for Apache Flume

Otis Gospodnetic Tue, 19 Nov 2013 19:27:58 -0800

Hi,

Don't want to beat a dead horse, but I just stumbled upon this email.  Note
how Israel wasn't aware of MorphlineInterceptor when he suggested
GrokInterceptor.  I think that's because MorphlineInterceptor lives under
the org.apache.flume.sink.*solr*.... package.


See:
* http://search-hadoop.com/m/0jVep1J1hJL&subj=MorphlineInterceptor+questions
*
http://search-hadoop.com/m/23imV1tSCQK1&subj=Questions+about+Morphline+Solr+Sink+structure

Otis
--
Performance Monitoring * Log Analytics * Search Analytics
Solr & Elasticsearch Support * http://sematext.com/


On Sat, Nov 16, 2013 at 5:26 AM, Wolfgang Hoschek <[email protected]>wrote:

> FYI, I've just added a new morphline command that returns Geolocation
> information for a given IP address, using an efficient in-memory Maxmind
> database lookup - https://issues.cloudera.org/browse/CDK-227
>
> This can then be used in the MorphlineInterceptor or Morphline Sink.
>
> Wolfgang.
>
> On Sep 7, 2013, at 8:47 PM, Israel Ekpo wrote:
>
> > Thank you everyone for your very constructive feedbacks. They were very
> > helpful.
> >
> > To provide some background, most of these suggestions have been inspired
> by
> > features I have found in Logstash [3].
> >
> > I am going to spend more time to understand how the cdk morphline
> commands
> > [4] work because I think it will really help with the transformation
> utils
> > needed in FileSource.
> >
> > Regarding the GrokInterceptor, I was not aware of the existence of
> > MorphlineInterceptor. It already does what I was proposing with
> > GrokInterceptor. So we are cool from that end.
> >
> > In simple standalone tests, the commons-io class that I am planning to
> use
> > for the FileSource handles file rotations well but I have not tested
> > renames or removals yet.
> >
> > Regarding the GeoIPInterceptor we can provide links for downloading the
> > Maxmind database seperately without bundling the IP database with Flume
> > releases.
> >
> > This is how the Logstash project does it.
> >
> > Because of the large number of events expected, I was planning to use
> > Lucene because of the speed of executing range queries from trie indexing
> > [5] and the results can also be cached in-memory if they have been
> > previously executed.
> >
> > I can perform some benchmarks with and without Lucene and see if the
> > performance differences justify using it for the lookups.
> >
> > My gut feeling is that using Lucene will lead to shorter processing times
> > as the volume of events increase.
> >
> > The RedisSource and RedisSink features will just be simple sources and
> > sinks. The sink will push [1] events to the Redis server and the source
> > will do a blocking pop [2] as it waits for new events to occur on the
> Redis
> > Server.
> >
> > I am still trying out a few things, this part is not yet finalized.
> >
> > Regarding contributing features as plugins, how are plugins typically
> > contributed and managed?
> >
> > Do I have to create github repo and manage it independently or are they
> > contributed as patches to the Flume project?
> >
> > [1] http://redis.io/commands/rpush
> > [2] http://redis.io/commands/blpop
> > [3] http://logstash.net/docs/1.2.1/
> > [4] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> > [5]
> >
> http://lucene.apache.org/core/4_4_0/core/org/apache/lucene/search/NumericRangeQuery.html
> >
> > *Author and Instructor for the Upcoming Book and Lecture Series*
> > *Massive Log Data Aggregation, Processing, Searching and Visualization
> with
> > Open Source Software*
> > *http://massivelogdata.com*
> >
> >
> > On Wed, Aug 28, 2013 at 1:21 PM, Wolfgang Hoschek <[email protected]
> >wrote:
> >
> >> Re: GrokInterceptor
> >>
> >> This functionality is already available in the form of the Apache Flume
> >> MorphlineInterceptor [1] with the grok command [2]. While grok is very
> >> useful, consider that grok alone often isn't enough - you typically need
> >> some other log event processing commands as well, for example as
> contained
> >> in morphlines [3].
> >>
> >> Re: FileSource
> >>
> >> True file tailing would be great.
> >>
> >> Merging multiple lines into one event can already be done with the
> >> MorphlineInterceptor with the readMultiLine command [4]. Or maybe embed
> a
> >> morphline directly into that new FileSource?
> >>
> >> Re: GeoIPInterceptor
> >>
> >> Seems to me that it would be more flexible, powerful and reusable to add
> >> this kind of functionality as a morphline command - contributions
> welcome!
> >>
> >> Finally, a word of caution, Maxmind is a good geo db, and I've used it
> >> before, but it has some LGPL issues that may or may not be workable in
> this
> >> context. Maxmind db fits into RAM - Lucene seems like overkill here -
> you
> >> can do fast maxmind lookups directly without Lucene.
> >>
> >> [1] http://flume.apache.org/FlumeUserGuide.html#morphline-interceptor
> >> [2]
> >>
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#grok
> >> [3] http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/index.html
> >> [4]
> >>
> http://cloudera.github.io/cdk/docs/0.6.0/cdk-morphlines/morphlinesReferenceGuide.html#readMultiLine
> >>
> >> Wolfgang.
> >>
> >>>
> >>> *FileSource*
> >>>
> >>> Using the Tailer feature from Apache Commons I/O utility [1], we can
> tail
> >>> specific files for events.
> >>>
> >>> This allows us to, regardless of the operating system, have the ability
> >> to
> >>> watch files for future events as they occur.
> >>>
> >>> It also allows us to step in and determine if two or more events should
> >> be
> >>> merged into one events if newline characters are present in an event.
> >>>
> >>> We can configure certain regular expressions that determines if a
> >> specific
> >>> line is a new event or part of the prevent event.
> >>>
> >>> Essentially, this source will have the ability to merge multiple lines
> >> into
> >>> one event before it is passed on to interceptors.
> >>>
> >>> It has been complicated group multiple lines into a single event with
> the
> >>> Spooling Directory Source or Exec Source. I tried creating custom
> >>> deserializers but it was hard to get around the logic used to parse the
> >>> files.
> >>>
> >>> Using the Spooling Directory also means we cannot watch the original
> >> files
> >>> so we need a background process to copy over the log files into the
> >>> spooling directory which requires additional setup.
> >>>
> >>> The tail command is not also available on all operating systems out of
> >> the
> >>> box.
> >>>
> >>>
> >>> *GrokInterceptor*
> >>>
> >>> With this interceptor we can parse semi-structure and unstructured text
> >> and
> >>> log data in the headers and body of the event into something structured
> >>> that can be easily queried.
> >>> I plan to use the information [2] and [3] for this.
> >>> With this interceptor, we can extract HTTP response codes, response
> >> times,
> >>> user agents, IP addresses and a whole bunch of useful data point from
> >> free
> >>> form text.
> >>>
> >>>
> >>>
> >>> *GeoIPInterceptor*
> >>>
> >>> This is for IP intelligence.
> >>>
> >>> This interceptor will allow us to use the value of an IP address in the
> >>> event header or body of the request to estimate the geographical
> location
> >>> of the IP address.
> >>>
> >>> Using the database available here [4], we can inject the two-letter
> code
> >> or
> >>> country name of the IP address into the event.
> >>>
> >>> We can also deduce other values such as city name, postalCode,
> latitude,
> >>> longitude, Internet Service Provider and Organization name.
> >>>
> >>> This can be very helpful in analyzing traffic patterns and target
> >> audience
> >>> from webserver or application logs.
> >>>
> >>> The database is loaded into a Lucene index when the agent is started
> up.
> >>> The index is only created once if it does not already exists.
> >>>
> >>> As the interceptor comes across events, it maps the IP address to a
> >> variety
> >>> of values that can be injected into the events.
> >>>
> >>>
> >>>
> >>> *RedisSink*
> >>>
> >>> This can provide another option for setting up a fan-in and/or fan-out
> >>> architecture.
> >>>
> >>> The RedisSink can serve as a queue that is used as a source by another
> >>> agent down the line.
> >>>
> >>> *References*
> >>> [1]
> >>>
> >>
> http://commons.apache.org/proper/commons-io/javadocs/api-release/org/apache/commons/io/input/Tailer.html
> >>> [2] https://github.com/NFLabs/java-grok
> >>> [3] http://www.anthonycorbacho.net/portfolio/grok-pattern/
> >>> [4] http://dev.maxmind.com/geoip/legacy/geolite/#Downloads
> >>> [5] http://dev.maxmind.com/geoip/legacy/csv/
> >>> [6] http://redis.io/documentation
> >>> [7] https://github.com/xetorthio/jedis
> >>>
> >>> *Author and Instructor for the Upcoming Book and Lecture Series*
> >>> *Massive Log Data Aggregation, Processing, Searching and Visualization
> >> with
> >>> Open Source Software*
> >>> *http://massivelogdata.com*
> >>
> >>
>
>

Re: New Features Proposed for Apache Flume

Reply via email to