+1 to the point about this being a different lookup pattern. If this were able to be done in a multi-get, I'd be all for HBase, but I worry about scan performance, a historical sore point for HBase architectures.
On Mon, Jan 16, 2017 at 11:32 AM, Simon Elliston Ball < si...@simonellistonball.com> wrote: > I like the idea of MapDB, since we can essentially pull an instance into > each supervisor, so it makes a lot of sense for relatively small scale, > relatively static enrichments in general. > > Generally this feels like a caching problem, and would be for a simple > key-value lookup. In that case I would agree with David Lyle on using HBase > as a source or truth and relying on caching. > > That said, GeoIP is a different lookup pattern, since it’s a range lookup > then a key lookup (or if we denormalize the MaxMind data, just a range > lookup) for that kind of thing MapDB with something like the BTree seems a > good fit. > > Simon > > > > On 16 Jan 2017, at 16:28, David Lyle <dlyle65...@gmail.com> wrote: > > > > I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an > > HBase enrichment. If our current caching isn't enough to mitigate the > above > > issues, we have a problem, don't we? Or do we not recommend HBase > > enrichment for per message enrichment in general? > > > > Also- can you elaborate on how MapDB would not require a network hop? > > Doesn't this mean we would have to sync the enrichment data to each Storm > > supervisor? HDFS could (probably would) have a network hop too, no? > > > > Fwiw - > > "In its place, I've looked at using MapDB, which is a really easy to use > > library for creating Java collections backed by a file (This is NOT a > > separate installation of anything, it's just a jar that manages > interaction > > with the file system). Given the slow churn of the GeoIP files (I > believe > > they get updated once a week), we can have a script that can be run when > > needed, downloads the MaxMind tar file, builds the MapDB file that will > be > > used by the bolts, and places it into HDFS. Finally, we update a config > to > > point to the new file, the bolts get the updated config callback and can > > update their db files. Inside the code, we wrap the MapDB portions to > make > > it transparent to downstream code." > > > > Seems a bit more complex than "refresh the hbase table". Afaik, either > > approach would require some sort of translation between GeoIP source > format > > and target format, so that part is a wash imo. > > > > So, I'd really like to see, at least, an attempt to leverage HBase > > enrichment. > > > > -D... > > > > > > On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ceste...@gmail.com> > wrote: > > > >> I think that it's a sensible thing to use MapDB for the geo enrichment. > >> Let me state my reasoning: > >> > >> - An HBase implementation would necessitate a HBase scan possibly > >> hitting HDFS, which is expensive per-message. > >> - An HBase implementation would necessitate a network hop and MapDB > >> would not. > >> > >> I also think this might be the beginning of a more general purpose > support > >> in Stellar for locally shipped, read-only MapDB lookups, which might be > >> interesting. > >> > >> In short, all quotes about premature optimization are sure to apply to > my > >> reasoning, but I can't help but have my spidey senses tingle when we > >> introduce a scan-per-message architecture. > >> > >> Casey > >> > >> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov < > dima.koval...@sstech.us> > >> wrote: > >> > >>> Hello Justin, > >>> > >>> Considering that Metron uses hbase tables for storing enrichment and > >>> threatintel feeds, can we use Hbase for geo enrichment as well? > >>> Or MapDB can be used for enrichment and threatintel feeds instead of > >> hbase? > >>> > >>> - Dima > >>> > >>> On 01/16/2017 04:17 PM, Justin Leet wrote: > >>>> Hi all, > >>>> > >>>> As a bit of background, right now, GeoIP data is loaded into and > >> managed > >>> by > >>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven > >>>> dependency on it before next release). We currently depend on and > >> install > >>>> an instance of MySQL (in each of the Management Pack, Ansible, and > >> Docker > >>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL > >>> and > >>>> query for a given IP. Additionally, it's a single point of failure > for > >>>> that particular enrichment right now. If MySQL is down, geo > enrichment > >>>> can't occur. > >>>> > >>>> I'm proposing that we eliminate the use of MySQL entirely, through all > >>>> installation paths (which, unless I missed some, includes Ansible, the > >>>> Ambari Management Pack, and Docker). We'd do this by dropping all the > >>>> various MySQL setup and management through the code, along with all > the > >>>> DDL, etc. The JDBCAdapter would stay, so that anybody who wants to > >> setup > >>>> their own databases for enrichments and install connectors is able to > >> do > >>> so. > >>>> > >>>> In its place, I've looked at using MapDB, which is a really easy to > use > >>>> library for creating Java collections backed by a file (This is NOT a > >>>> separate installation of anything, it's just a jar that manages > >>> interaction > >>>> with the file system). Given the slow churn of the GeoIP files (I > >>> believe > >>>> they get updated once a week), we can have a script that can be run > >> when > >>>> needed, downloads the MaxMind tar file, builds the MapDB file that > will > >>> be > >>>> used by the bolts, and places it into HDFS. Finally, we update a > >> config > >>> to > >>>> point to the new file, the bolts get the updated config callback and > >> can > >>>> update their db files. Inside the code, we wrap the MapDB portions to > >>> make > >>>> it transparent to downstream code. > >>>> > >>>> The particularly nice parts about using MapDB are that its ease of use > >>> plus > >>>> it offers the utilities we need out of the box to be able to support > >> the > >>>> operations we need on this (Keep in mind the GeoIP files use IP ranges > >>> and > >>>> we need to be able to easily grab the appropriate range). > >>>> > >>>> The main point of concern I have about this is that when we grab the > >> HDFS > >>>> file during an update, given that multiple JVMs can be running, we > >> don't > >>>> want them to clobber each other. I believe this can be avoided by > >> simply > >>>> using each worker's working directory to store the file (and > >>> appropriately > >>>> ensure threads on the same JVM manage multithreading). This should > >> keep > >>>> the JVMs (and the underlying DB files) entirely independent. > >>>> > >>>> This script would get called by the various installations during > >> startup > >>> to > >>>> do the initial setup. After install, it can then be called on demand > >> in > >>>> order. > >>>> > >>>> At this point, we should be all set, with everything running and > >>> updatable. > >>>> > >>>> Justin > >>>> > >>> > >>> > >> > >