I think that it's a sensible thing to use MapDB for the geo enrichment. Let me state my reasoning:
- An HBase implementation would necessitate a HBase scan possibly hitting HDFS, which is expensive per-message. - An HBase implementation would necessitate a network hop and MapDB would not. I also think this might be the beginning of a more general purpose support in Stellar for locally shipped, read-only MapDB lookups, which might be interesting. In short, all quotes about premature optimization are sure to apply to my reasoning, but I can't help but have my spidey senses tingle when we introduce a scan-per-message architecture. Casey On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <dima.koval...@sstech.us> wrote: > Hello Justin, > > Considering that Metron uses hbase tables for storing enrichment and > threatintel feeds, can we use Hbase for geo enrichment as well? > Or MapDB can be used for enrichment and threatintel feeds instead of hbase? > > - Dima > > On 01/16/2017 04:17 PM, Justin Leet wrote: > > Hi all, > > > > As a bit of background, right now, GeoIP data is loaded into and managed > by > > MySQL (the connectors are LGPL licensed and we need to sever our Maven > > dependency on it before next release). We currently depend on and install > > an instance of MySQL (in each of the Management Pack, Ansible, and Docker > > installs). In the topology, we use the JDBCAdapter to connect to MySQL > and > > query for a given IP. Additionally, it's a single point of failure for > > that particular enrichment right now. If MySQL is down, geo enrichment > > can't occur. > > > > I'm proposing that we eliminate the use of MySQL entirely, through all > > installation paths (which, unless I missed some, includes Ansible, the > > Ambari Management Pack, and Docker). We'd do this by dropping all the > > various MySQL setup and management through the code, along with all the > > DDL, etc. The JDBCAdapter would stay, so that anybody who wants to setup > > their own databases for enrichments and install connectors is able to do > so. > > > > In its place, I've looked at using MapDB, which is a really easy to use > > library for creating Java collections backed by a file (This is NOT a > > separate installation of anything, it's just a jar that manages > interaction > > with the file system). Given the slow churn of the GeoIP files (I > believe > > they get updated once a week), we can have a script that can be run when > > needed, downloads the MaxMind tar file, builds the MapDB file that will > be > > used by the bolts, and places it into HDFS. Finally, we update a config > to > > point to the new file, the bolts get the updated config callback and can > > update their db files. Inside the code, we wrap the MapDB portions to > make > > it transparent to downstream code. > > > > The particularly nice parts about using MapDB are that its ease of use > plus > > it offers the utilities we need out of the box to be able to support the > > operations we need on this (Keep in mind the GeoIP files use IP ranges > and > > we need to be able to easily grab the appropriate range). > > > > The main point of concern I have about this is that when we grab the HDFS > > file during an update, given that multiple JVMs can be running, we don't > > want them to clobber each other. I believe this can be avoided by simply > > using each worker's working directory to store the file (and > appropriately > > ensure threads on the same JVM manage multithreading). This should keep > > the JVMs (and the underlying DB files) entirely independent. > > > > This script would get called by the various installations during startup > to > > do the initial setup. After install, it can then be called on demand in > > order. > > > > At this point, we should be all set, with everything running and > updatable. > > > > Justin > > > >