I like the idea of MapDB, since we can essentially pull an instance into each supervisor, so it makes a lot of sense for relatively small scale, relatively static enrichments in general.
Generally this feels like a caching problem, and would be for a simple key-value lookup. In that case I would agree with David Lyle on using HBase as a source or truth and relying on caching. That said, GeoIP is a different lookup pattern, since it’s a range lookup then a key lookup (or if we denormalize the MaxMind data, just a range lookup) for that kind of thing MapDB with something like the BTree seems a good fit. Simon > On 16 Jan 2017, at 16:28, David Lyle <dlyle65...@gmail.com> wrote: > > I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an > HBase enrichment. If our current caching isn't enough to mitigate the above > issues, we have a problem, don't we? Or do we not recommend HBase > enrichment for per message enrichment in general? > > Also- can you elaborate on how MapDB would not require a network hop? > Doesn't this mean we would have to sync the enrichment data to each Storm > supervisor? HDFS could (probably would) have a network hop too, no? > > Fwiw - > "In its place, I've looked at using MapDB, which is a really easy to use > library for creating Java collections backed by a file (This is NOT a > separate installation of anything, it's just a jar that manages interaction > with the file system). Given the slow churn of the GeoIP files (I believe > they get updated once a week), we can have a script that can be run when > needed, downloads the MaxMind tar file, builds the MapDB file that will be > used by the bolts, and places it into HDFS. Finally, we update a config to > point to the new file, the bolts get the updated config callback and can > update their db files. Inside the code, we wrap the MapDB portions to make > it transparent to downstream code." > > Seems a bit more complex than "refresh the hbase table". Afaik, either > approach would require some sort of translation between GeoIP source format > and target format, so that part is a wash imo. > > So, I'd really like to see, at least, an attempt to leverage HBase > enrichment. > > -D... > > > On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ceste...@gmail.com> wrote: > >> I think that it's a sensible thing to use MapDB for the geo enrichment. >> Let me state my reasoning: >> >> - An HBase implementation would necessitate a HBase scan possibly >> hitting HDFS, which is expensive per-message. >> - An HBase implementation would necessitate a network hop and MapDB >> would not. >> >> I also think this might be the beginning of a more general purpose support >> in Stellar for locally shipped, read-only MapDB lookups, which might be >> interesting. >> >> In short, all quotes about premature optimization are sure to apply to my >> reasoning, but I can't help but have my spidey senses tingle when we >> introduce a scan-per-message architecture. >> >> Casey >> >> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <dima.koval...@sstech.us> >> wrote: >> >>> Hello Justin, >>> >>> Considering that Metron uses hbase tables for storing enrichment and >>> threatintel feeds, can we use Hbase for geo enrichment as well? >>> Or MapDB can be used for enrichment and threatintel feeds instead of >> hbase? >>> >>> - Dima >>> >>> On 01/16/2017 04:17 PM, Justin Leet wrote: >>>> Hi all, >>>> >>>> As a bit of background, right now, GeoIP data is loaded into and >> managed >>> by >>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven >>>> dependency on it before next release). We currently depend on and >> install >>>> an instance of MySQL (in each of the Management Pack, Ansible, and >> Docker >>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL >>> and >>>> query for a given IP. Additionally, it's a single point of failure for >>>> that particular enrichment right now. If MySQL is down, geo enrichment >>>> can't occur. >>>> >>>> I'm proposing that we eliminate the use of MySQL entirely, through all >>>> installation paths (which, unless I missed some, includes Ansible, the >>>> Ambari Management Pack, and Docker). We'd do this by dropping all the >>>> various MySQL setup and management through the code, along with all the >>>> DDL, etc. The JDBCAdapter would stay, so that anybody who wants to >> setup >>>> their own databases for enrichments and install connectors is able to >> do >>> so. >>>> >>>> In its place, I've looked at using MapDB, which is a really easy to use >>>> library for creating Java collections backed by a file (This is NOT a >>>> separate installation of anything, it's just a jar that manages >>> interaction >>>> with the file system). Given the slow churn of the GeoIP files (I >>> believe >>>> they get updated once a week), we can have a script that can be run >> when >>>> needed, downloads the MaxMind tar file, builds the MapDB file that will >>> be >>>> used by the bolts, and places it into HDFS. Finally, we update a >> config >>> to >>>> point to the new file, the bolts get the updated config callback and >> can >>>> update their db files. Inside the code, we wrap the MapDB portions to >>> make >>>> it transparent to downstream code. >>>> >>>> The particularly nice parts about using MapDB are that its ease of use >>> plus >>>> it offers the utilities we need out of the box to be able to support >> the >>>> operations we need on this (Keep in mind the GeoIP files use IP ranges >>> and >>>> we need to be able to easily grab the appropriate range). >>>> >>>> The main point of concern I have about this is that when we grab the >> HDFS >>>> file during an update, given that multiple JVMs can be running, we >> don't >>>> want them to clobber each other. I believe this can be avoided by >> simply >>>> using each worker's working directory to store the file (and >>> appropriately >>>> ensure threads on the same JVM manage multithreading). This should >> keep >>>> the JVMs (and the underlying DB files) entirely independent. >>>> >>>> This script would get called by the various installations during >> startup >>> to >>>> do the initial setup. After install, it can then be called on demand >> in >>>> order. >>>> >>>> At this point, we should be all set, with everything running and >>> updatable. >>>> >>>> Justin >>>> >>> >>> >>