Re: [DISCUSS] Moving GeoIP management away from MySQL

Casey Stella Mon, 16 Jan 2017 08:03:14 -0800

I think that it's a sensible thing to use MapDB for the geo enrichment.
Let me state my reasoning:


   - An HBase implementation  would necessitate a HBase scan possibly
   hitting HDFS, which is expensive per-message.
   - An HBase implementation would necessitate a network hop and MapDB
   would not.

I also think this might be the beginning of a more general purpose support
in Stellar for locally shipped, read-only MapDB lookups, which might be
interesting.

In short, all quotes about premature optimization are sure to apply to my
reasoning, but I can't help but have my spidey senses tingle when we
introduce a scan-per-message architecture.

Casey

On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <dima.koval...@sstech.us>
wrote:

> Hello Justin,
>
> Considering that Metron uses hbase tables for storing enrichment and
> threatintel feeds, can we use Hbase for geo enrichment as well?
> Or MapDB can be used for enrichment and threatintel feeds instead of hbase?
>
> - Dima
>
> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > Hi all,
> >
> > As a bit of background, right now, GeoIP data is loaded into and managed
> by
> > MySQL (the connectors are LGPL licensed and we need to sever our Maven
> > dependency on it before next release). We currently depend on and install
> > an instance of MySQL (in each of the Management Pack, Ansible, and Docker
> > installs). In the topology, we use the JDBCAdapter to connect to MySQL
> and
> > query for a given IP.  Additionally, it's a single point of failure for
> > that particular enrichment right now.  If MySQL is down, geo enrichment
> > can't occur.
> >
> > I'm proposing that we eliminate the use of MySQL entirely, through all
> > installation paths (which, unless I missed some, includes Ansible, the
> > Ambari Management Pack, and Docker).  We'd do this by dropping all the
> > various MySQL setup and management through the code, along with all the
> > DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to setup
> > their own databases for enrichments and install connectors is able to do
> so.
> >
> > In its place, I've looked at using MapDB, which is a really easy to use
> > library for creating Java collections backed by a file (This is NOT a
> > separate installation of anything, it's just a jar that manages
> interaction
> > with the file system).  Given the slow churn of the GeoIP files (I
> believe
> > they get updated once a week), we can have a script that can be run when
> > needed, downloads the MaxMind tar file, builds the MapDB file that will
> be
> > used by the bolts, and places it into HDFS.  Finally, we update a config
> to
> > point to the new file, the bolts get the updated config callback and can
> > update their db files.  Inside the code, we wrap the MapDB portions to
> make
> > it transparent to downstream code.
> >
> > The particularly nice parts about using MapDB are that its ease of use
> plus
> > it offers the utilities we need out of the box to be able to support the
> > operations we need on this (Keep in mind the GeoIP files use IP ranges
> and
> > we need to be able to easily grab the appropriate range).
> >
> > The main point of concern I have about this is that when we grab the HDFS
> > file during an update, given that multiple JVMs can be running, we don't
> > want them to clobber each other. I believe this can be avoided by simply
> > using each worker's working directory to store the file (and
> appropriately
> > ensure threads on the same JVM manage multithreading).  This should keep
> > the JVMs (and the underlying DB files) entirely independent.
> >
> > This script would get called by the various installations during startup
> to
> > do the initial setup.  After install, it can then be called on demand in
> > order.
> >
> > At this point, we should be all set, with everything running and
> updatable.
> >
> > Justin
> >
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to