+1 Agree with David's order -Kyle
On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65...@gmail.com> wrote: > Def agree on the parity point. > > I'm a little worried about Supervisor relocations for non-HBase solutions, > but having much of the work done for us by MaxMind changes my preference to > (in order) > > 1) MM API > 2) HBase Enrichment > 3) MapDB should the others prove not feasible > > > -D... > > > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <justinjl...@gmail.com> > wrote: > > > I definitely agree on checking out the MaxMind API. I'll take a look at > > it, but at first glance it looks like it does include everything we use. > > Great find, JJ. > > > > More details on various people's points: > > > > As a note to anyone hopping in, Simon's point on the range lookup vs a > key > > lookup is why it becomes a Scan in HBase vs a Get. As an addendum to > what > > Simon mentioned, denormalizing is easy enough and turns it into an easy > > range lookup. > > > > To David's point, the MapDB does require a network hop, but it's once per > > refresh of the data (Got a relevant callback? Grab new data, load it, > swap > > out) instead of (up to) once per message. I would expect the same to be > > true of the MaxMind db files. > > > > I'd also argue MapDB not really more complex than refreshing the HBase > > table, because we potentially have to start worrying about things like > > hashing and/or indices and even just general data represtation. It's > > definitely correct that the file processing has to occur on either path, > so > > it really boils down to handling the callback and reloading the file vs > > handling some of the standard HBasey things. I don't think either is an > > enormous amount of work (and both are almost certainly more work than > > MaxMind's API) > > > > Regarding extensibility, I'd argue for parity with what we have first, > then > > build what we need from there. Does anybody have any disagreement with > > that approach for right now? > > > > Justin > > > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dlyle65...@gmail.com> > wrote: > > > > > It is interesting- save us a ton of effort, and has the right license. > I > > > think it's worth at least checking out. > > > > > > -D... > > > > > > > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball < > > > si...@simonellistonball.com> wrote: > > > > > > > I like that approach even more. That way we would only have to worry > > > about > > > > distributing the database file in binary format to all the supervisor > > > nodes > > > > on update. > > > > > > > > It would also make it easier for people to switch to the enterprise > DB > > > > potentially if they had the license. > > > > > > > > One slight issue with this might be for people who wanted to extend > the > > > > database. For example, organisations may want to add geo-enrichment > to > > > > their own private network addresses based modified versions of the > geo > > > > database. Currently we don’t really allow this, since we hard-code > > > ignoring > > > > private network classes into the geo enrichment adapter, but I can > see > > a > > > > case where a global org might want to add their own ranges and > > locations > > > to > > > > the data set. Does that make sense to anyone else? > > > > > > > > Simon > > > > > > > > > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmey...@gmail.com> wrote: > > > > > > > > > > Hello all, > > > > > > > > > > Can we leverage maxmind's Java client ( > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/ > > > > main/java/com/maxmind/geoip2) > > > > > in this case? I believe it can directly read maxmind file. Plus I > > think > > > > it > > > > > also has some support for caching as well. > > > > > > > > > > Thanks, > > > > > JJ > > > > > > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball < > > > > > si...@simonellistonball.com> wrote: > > > > > > > > > >> I like the idea of MapDB, since we can essentially pull an > instance > > > into > > > > >> each supervisor, so it makes a lot of sense for relatively small > > > scale, > > > > >> relatively static enrichments in general. > > > > >> > > > > >> Generally this feels like a caching problem, and would be for a > > simple > > > > >> key-value lookup. In that case I would agree with David Lyle on > > using > > > > HBase > > > > >> as a source or truth and relying on caching. > > > > >> > > > > >> That said, GeoIP is a different lookup pattern, since it’s a range > > > > lookup > > > > >> then a key lookup (or if we denormalize the MaxMind data, just a > > range > > > > >> lookup) for that kind of thing MapDB with something like the BTree > > > > seems a > > > > >> good fit. > > > > >> > > > > >> Simon > > > > >> > > > > >> > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <dlyle65...@gmail.com> > wrote: > > > > >>> > > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see > it > > > as > > > > an > > > > >>> HBase enrichment. If our current caching isn't enough to mitigate > > the > > > > >> above > > > > >>> issues, we have a problem, don't we? Or do we not recommend HBase > > > > >>> enrichment for per message enrichment in general? > > > > >>> > > > > >>> Also- can you elaborate on how MapDB would not require a network > > hop? > > > > >>> Doesn't this mean we would have to sync the enrichment data to > each > > > > Storm > > > > >>> supervisor? HDFS could (probably would) have a network hop too, > no? > > > > >>> > > > > >>> Fwiw - > > > > >>> "In its place, I've looked at using MapDB, which is a really easy > > to > > > > use > > > > >>> library for creating Java collections backed by a file (This is > > NOT a > > > > >>> separate installation of anything, it's just a jar that manages > > > > >> interaction > > > > >>> with the file system). Given the slow churn of the GeoIP files > (I > > > > >> believe > > > > >>> they get updated once a week), we can have a script that can be > run > > > > when > > > > >>> needed, downloads the MaxMind tar file, builds the MapDB file > that > > > will > > > > >> be > > > > >>> used by the bolts, and places it into HDFS. Finally, we update a > > > > config > > > > >> to > > > > >>> point to the new file, the bolts get the updated config callback > > and > > > > can > > > > >>> update their db files. Inside the code, we wrap the MapDB > portions > > > to > > > > >> make > > > > >>> it transparent to downstream code." > > > > >>> > > > > >>> Seems a bit more complex than "refresh the hbase table". Afaik, > > > either > > > > >>> approach would require some sort of translation between GeoIP > > source > > > > >> format > > > > >>> and target format, so that part is a wash imo. > > > > >>> > > > > >>> So, I'd really like to see, at least, an attempt to leverage > HBase > > > > >>> enrichment. > > > > >>> > > > > >>> -D... > > > > >>> > > > > >>> > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella < > ceste...@gmail.com > > > > > > > >> wrote: > > > > >>> > > > > >>>> I think that it's a sensible thing to use MapDB for the geo > > > > enrichment. > > > > >>>> Let me state my reasoning: > > > > >>>> > > > > >>>> - An HBase implementation would necessitate a HBase scan > > possibly > > > > >>>> hitting HDFS, which is expensive per-message. > > > > >>>> - An HBase implementation would necessitate a network hop and > > MapDB > > > > >>>> would not. > > > > >>>> > > > > >>>> I also think this might be the beginning of a more general > purpose > > > > >> support > > > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which > > might > > > > be > > > > >>>> interesting. > > > > >>>> > > > > >>>> In short, all quotes about premature optimization are sure to > > apply > > > to > > > > >> my > > > > >>>> reasoning, but I can't help but have my spidey senses tingle > when > > we > > > > >>>> introduce a scan-per-message architecture. > > > > >>>> > > > > >>>> Casey > > > > >>>> > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov < > > > > >> dima.koval...@sstech.us> > > > > >>>> wrote: > > > > >>>> > > > > >>>>> Hello Justin, > > > > >>>>> > > > > >>>>> Considering that Metron uses hbase tables for storing > enrichment > > > and > > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well? > > > > >>>>> Or MapDB can be used for enrichment and threatintel feeds > instead > > > of > > > > >>>> hbase? > > > > >>>>> > > > > >>>>> - Dima > > > > >>>>> > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote: > > > > >>>>>> Hi all, > > > > >>>>>> > > > > >>>>>> As a bit of background, right now, GeoIP data is loaded into > and > > > > >>>> managed > > > > >>>>> by > > > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever > our > > > > Maven > > > > >>>>>> dependency on it before next release). We currently depend on > > and > > > > >>>> install > > > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible, > > and > > > > >>>> Docker > > > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect > to > > > > MySQL > > > > >>>>> and > > > > >>>>>> query for a given IP. Additionally, it's a single point of > > > failure > > > > >> for > > > > >>>>>> that particular enrichment right now. If MySQL is down, geo > > > > >> enrichment > > > > >>>>>> can't occur. > > > > >>>>>> > > > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely, > > through > > > > all > > > > >>>>>> installation paths (which, unless I missed some, includes > > Ansible, > > > > the > > > > >>>>>> Ambari Management Pack, and Docker). We'd do this by dropping > > all > > > > the > > > > >>>>>> various MySQL setup and management through the code, along > with > > > all > > > > >> the > > > > >>>>>> DDL, etc. The JDBCAdapter would stay, so that anybody who > wants > > > to > > > > >>>> setup > > > > >>>>>> their own databases for enrichments and install connectors is > > able > > > > to > > > > >>>> do > > > > >>>>> so. > > > > >>>>>> > > > > >>>>>> In its place, I've looked at using MapDB, which is a really > easy > > > to > > > > >> use > > > > >>>>>> library for creating Java collections backed by a file (This > is > > > NOT > > > > a > > > > >>>>>> separate installation of anything, it's just a jar that > manages > > > > >>>>> interaction > > > > >>>>>> with the file system). Given the slow churn of the GeoIP > files > > (I > > > > >>>>> believe > > > > >>>>>> they get updated once a week), we can have a script that can > be > > > run > > > > >>>> when > > > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file > > that > > > > >> will > > > > >>>>> be > > > > >>>>>> used by the bolts, and places it into HDFS. Finally, we > update > > a > > > > >>>> config > > > > >>>>> to > > > > >>>>>> point to the new file, the bolts get the updated config > callback > > > and > > > > >>>> can > > > > >>>>>> update their db files. Inside the code, we wrap the MapDB > > > portions > > > > to > > > > >>>>> make > > > > >>>>>> it transparent to downstream code. > > > > >>>>>> > > > > >>>>>> The particularly nice parts about using MapDB are that its > ease > > of > > > > use > > > > >>>>> plus > > > > >>>>>> it offers the utilities we need out of the box to be able to > > > support > > > > >>>> the > > > > >>>>>> operations we need on this (Keep in mind the GeoIP files use > IP > > > > ranges > > > > >>>>> and > > > > >>>>>> we need to be able to easily grab the appropriate range). > > > > >>>>>> > > > > >>>>>> The main point of concern I have about this is that when we > grab > > > the > > > > >>>> HDFS > > > > >>>>>> file during an update, given that multiple JVMs can be > running, > > we > > > > >>>> don't > > > > >>>>>> want them to clobber each other. I believe this can be avoided > > by > > > > >>>> simply > > > > >>>>>> using each worker's working directory to store the file (and > > > > >>>>> appropriately > > > > >>>>>> ensure threads on the same JVM manage multithreading). This > > > should > > > > >>>> keep > > > > >>>>>> the JVMs (and the underlying DB files) entirely independent. > > > > >>>>>> > > > > >>>>>> This script would get called by the various installations > during > > > > >>>> startup > > > > >>>>> to > > > > >>>>>> do the initial setup. After install, it can then be called on > > > > demand > > > > >>>> in > > > > >>>>>> order. > > > > >>>>>> > > > > >>>>>> At this point, we should be all set, with everything running > and > > > > >>>>> updatable. > > > > >>>>>> > > > > >>>>>> Justin > > > > >>>>>> > > > > >>>>> > > > > >>>>> > > > > >>>> > > > > >> > > > > >> > > > > > > > > > > > > > >