+1 to using the Java API with the MMDB file provided by Maxmind. This is what I had thought we were doing when we discussed this a few months back. I'd rather use the Maxmind tools as-provided instead of engineering something on top of it.
On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmey...@gmail.com> wrote: > Matt, I agree with your points on why we shouldn't just get rid of the > database just to get rid of a database. But IMO, I think we may be > reinventing the wheel a little bit by even putting the maxmind data into > MySQL. Right now we are already downloading a maxmind file. To me it seems > simpler to push the file to HDFS where we can pick it up and have the > maxmind client use that instead of importing data into a DB and then > running a query. Also, I believe the data gets updated weekly. So syncing > may become easier too. > > James, I believe it works with the paid and free versions of geoip. I know > NiFi uses this client library in their Geo enrichment processor. > > Also, if it is decided that using a SQL database is still the best > solution, I think there is a benefit to using their library. We would just > have to implement a `DatabaseProvider` that hits some SQL db instead of > using their standard implementation. > > Thanks, > JJ > > On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsir...@apache.org> wrote: > > > Hi Guys, I just wanted to clarify one point that I think is lost in this > > tread. Geo enrichment is NOT a key-value enrichment. It requires a > range > > scan and a join (which is why it's implemented via mySql and not Hbase). > > To account for this access pattern via a key-value store you would > > inevitably have to do something funky or in case of Hbase I don't think > > there is a way to avoid doing a range scan. > > > > With respect to mapdb it only has support for Maps, Sets, Lists, Queues. > > Are we sure it provides enough functionality for us to do this > enrichment? > > > > With respect to the Maxmind client, are we sure we can use it on the > > mySql-backed version of their DB? I thought the Maxmind database itself > is > > proprietary and is something you have to pay for. My understanding is > that > > the client is designed for that proprietary version. > > > > I somewhat agree with Matt's point. If mySql is a problem because of > > licensing, the path of least resistance to remove mySql dependencies > would > > be to simply switch to postgresql. We will always have conventional sql > > databases in our stack because other big data tools use them. Why not > take > > advantage of them too? > > > > Thanks, > > James > > > > 16.01.2017, 12:27, "Matt Foley" <ma...@apache.org>: > > > Hi Justin, and team, > > > Several components of the Hadoop Stack utilize a SQL database, usually > > for metadata of some sort. Ambari knows this and arranges for them to > share > > a single database installation (on or off the cluster), unless they > > explicitly configure use of different databases (which is allowed for > sites > > that desire it). Ambari defaults to using PostgreSQL, altho it’s happy to > > use MySQL, Oracle, or Microsoft, along with whatever each component > > historically defined as their default (such as Derby). > > > > > > If we want to start with a replacement of current functionality, I > would > > suggest switching the default database to PostgreSQL. Replacing fast, > > efficient, and proven db services with a file-based api library (but no > > standard way to propagate the underlying storage files) seems to me to be > > taking a step backwards. > > > > > > Sticking with a SQL-based service will surely minimize the amount of > > code changes needed. And making the SQL either dialect-independent or > > capable of switching among dialects, then enables us to do what the rest > of > > the Hadoop stack does: allow enterprise customers to substitute Oracle or > > Microsoft enterprise-class databases where they wish. Regarding the > > drivers, we should study what the other Stack components do; I’m not an > > expert in those areas. > > > > > > Using the same db as the rest of the stack also means administrators > can > > be confident they’ve set up adequate backup and recovery processes. > > > All these are valuable reasons not to roll our own storage system for > > this enrichment data. IMO, of course. > > > > > > Cheers, > > > --Matt > > > > > > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichards...@gmail.com> > > wrote: > > > > > > +1 Agree with David's order > > > > > > -Kyle > > > > > > On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65...@gmail.com > > > > wrote: > > > > > > > Def agree on the parity point. > > > > > > > > I'm a little worried about Supervisor relocations for non-HBase > > solutions, > > > > but having much of the work done for us by MaxMind changes my > > preference to > > > > (in order) > > > > > > > > 1) MM API > > > > 2) HBase Enrichment > > > > 3) MapDB should the others prove not feasible > > > > > > > > > > > > -D... > > > > > > > > > > > > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet < > > justinjl...@gmail.com> > > > > wrote: > > > > > > > > > I definitely agree on checking out the MaxMind API. I'll take a > > look at > > > > > it, but at first glance it looks like it does include > everything > > we use. > > > > > Great find, JJ. > > > > > > > > > > More details on various people's points: > > > > > > > > > > As a note to anyone hopping in, Simon's point on the range > > lookup vs a > > > > key > > > > > lookup is why it becomes a Scan in HBase vs a Get. As an > > addendum to > > > > what > > > > > Simon mentioned, denormalizing is easy enough and turns it into > > an easy > > > > > range lookup. > > > > > > > > > > To David's point, the MapDB does require a network hop, but > it's > > once per > > > > > refresh of the data (Got a relevant callback? Grab new data, > > load it, > > > > swap > > > > > out) instead of (up to) once per message. I would expect the > > same to be > > > > > true of the MaxMind db files. > > > > > > > > > > I'd also argue MapDB not really more complex than refreshing > the > > HBase > > > > > table, because we potentially have to start worrying about > > things like > > > > > hashing and/or indices and even just general data represtation. > > It's > > > > > definitely correct that the file processing has to occur on > > either path, > > > > so > > > > > it really boils down to handling the callback and reloading the > > file vs > > > > > handling some of the standard HBasey things. I don't think > > either is an > > > > > enormous amount of work (and both are almost certainly more > work > > than > > > > > MaxMind's API) > > > > > > > > > > Regarding extensibility, I'd argue for parity with what we have > > first, > > > > then > > > > > build what we need from there. Does anybody have any > > disagreement with > > > > > that approach for right now? > > > > > > > > > > Justin > > > > > > > > > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle < > > dlyle65...@gmail.com> > > > > wrote: > > > > > > > > > > > It is interesting- save us a ton of effort, and has the right > > license. > > > > I > > > > > > think it's worth at least checking out. > > > > > > > > > > > > -D... > > > > > > > > > > > > > > > > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball < > > > > > > si...@simonellistonball.com> wrote: > > > > > > > > > > > > > I like that approach even more. That way we would only have > > to worry > > > > > > about > > > > > > > distributing the database file in binary format to all the > > supervisor > > > > > > nodes > > > > > > > on update. > > > > > > > > > > > > > > It would also make it easier for people to switch to the > > enterprise > > > > DB > > > > > > > potentially if they had the license. > > > > > > > > > > > > > > One slight issue with this might be for people who wanted > to > > extend > > > > the > > > > > > > database. For example, organisations may want to add > > geo-enrichment > > > > to > > > > > > > their own private network addresses based modified versions > > of the > > > > geo > > > > > > > database. Currently we don’t really allow this, since we > > hard-code > > > > > > ignoring > > > > > > > private network classes into the geo enrichment adapter, > but > > I can > > > > see > > > > > a > > > > > > > case where a global org might want to add their own ranges > > and > > > > > locations > > > > > > to > > > > > > > the data set. Does that make sense to anyone else? > > > > > > > > > > > > > > Simon > > > > > > > > > > > > > > > > > > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmey...@gmail.com> > > wrote: > > > > > > > > > > > > > > > > Hello all, > > > > > > > > > > > > > > > > Can we leverage maxmind's Java client ( > > > > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/ > > > > > > > main/java/com/maxmind/geoip2) > > > > > > > > in this case? I believe it can directly read maxmind > file. > > Plus I > > > > > think > > > > > > > it > > > > > > > > also has some support for caching as well. > > > > > > > > > > > > > > > > Thanks, > > > > > > > > JJ > > > > > > > > > > > > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball < > > > > > > > > si...@simonellistonball.com> wrote: > > > > > > > > > > > > > > > >> I like the idea of MapDB, since we can essentially pull > an > > > > instance > > > > > > into > > > > > > > >> each supervisor, so it makes a lot of sense for > > relatively small > > > > > > scale, > > > > > > > >> relatively static enrichments in general. > > > > > > > >> > > > > > > > >> Generally this feels like a caching problem, and would > be > > for a > > > > > simple > > > > > > > >> key-value lookup. In that case I would agree with David > > Lyle on > > > > > using > > > > > > > HBase > > > > > > > >> as a source or truth and relying on caching. > > > > > > > >> > > > > > > > >> That said, GeoIP is a different lookup pattern, since > > it’s a range > > > > > > > lookup > > > > > > > >> then a key lookup (or if we denormalize the MaxMind > data, > > just a > > > > > range > > > > > > > >> lookup) for that kind of thing MapDB with something like > > the BTree > > > > > > > seems a > > > > > > > >> good fit. > > > > > > > >> > > > > > > > >> Simon > > > > > > > >> > > > > > > > >> > > > > > > > >>> On 16 Jan 2017, at 16:28, David Lyle < > > dlyle65...@gmail.com> > > > > wrote: > > > > > > > >>> > > > > > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd > > prefer to see > > > > it > > > > > > as > > > > > > > an > > > > > > > >>> HBase enrichment. If our current caching isn't enough > to > > mitigate > > > > > the > > > > > > > >> above > > > > > > > >>> issues, we have a problem, don't we? Or do we not > > recommend HBase > > > > > > > >>> enrichment for per message enrichment in general? > > > > > > > >>> > > > > > > > >>> Also- can you elaborate on how MapDB would not require > a > > network > > > > > hop? > > > > > > > >>> Doesn't this mean we would have to sync the enrichment > > data to > > > > each > > > > > > > Storm > > > > > > > >>> supervisor? HDFS could (probably would) have a network > > hop too, > > > > no? > > > > > > > >>> > > > > > > > >>> Fwiw - > > > > > > > >>> "In its place, I've looked at using MapDB, which is a > > really easy > > > > > to > > > > > > > use > > > > > > > >>> library for creating Java collections backed by a file > > (This is > > > > > NOT a > > > > > > > >>> separate installation of anything, it's just a jar that > > manages > > > > > > > >> interaction > > > > > > > >>> with the file system). Given the slow churn of the > GeoIP > > files > > > > (I > > > > > > > >> believe > > > > > > > >>> they get updated once a week), we can have a script > that > > can be > > > > run > > > > > > > when > > > > > > > >>> needed, downloads the MaxMind tar file, builds the > MapDB > > file > > > > that > > > > > > will > > > > > > > >> be > > > > > > > >>> used by the bolts, and places it into HDFS. Finally, we > > update a > > > > > > > config > > > > > > > >> to > > > > > > > >>> point to the new file, the bolts get the updated config > > callback > > > > > and > > > > > > > can > > > > > > > >>> update their db files. Inside the code, we wrap the > MapDB > > > > portions > > > > > > to > > > > > > > >> make > > > > > > > >>> it transparent to downstream code." > > > > > > > >>> > > > > > > > >>> Seems a bit more complex than "refresh the hbase > table". > > Afaik, > > > > > > either > > > > > > > >>> approach would require some sort of translation between > > GeoIP > > > > > source > > > > > > > >> format > > > > > > > >>> and target format, so that part is a wash imo. > > > > > > > >>> > > > > > > > >>> So, I'd really like to see, at least, an attempt to > > leverage > > > > HBase > > > > > > > >>> enrichment. > > > > > > > >>> > > > > > > > >>> -D... > > > > > > > >>> > > > > > > > >>> > > > > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella < > > > > ceste...@gmail.com > > > > > > > > > > > > > >> wrote: > > > > > > > >>> > > > > > > > >>>> I think that it's a sensible thing to use MapDB for > the > > geo > > > > > > > enrichment. > > > > > > > >>>> Let me state my reasoning: > > > > > > > >>>> > > > > > > > >>>> - An HBase implementation would necessitate a HBase > scan > > > > > possibly > > > > > > > >>>> hitting HDFS, which is expensive per-message. > > > > > > > >>>> - An HBase implementation would necessitate a network > > hop and > > > > > MapDB > > > > > > > >>>> would not. > > > > > > > >>>> > > > > > > > >>>> I also think this might be the beginning of a more > > general > > > > purpose > > > > > > > >> support > > > > > > > >>>> in Stellar for locally shipped, read-only MapDB > > lookups, which > > > > > might > > > > > > > be > > > > > > > >>>> interesting. > > > > > > > >>>> > > > > > > > >>>> In short, all quotes about premature optimization are > > sure to > > > > > apply > > > > > > to > > > > > > > >> my > > > > > > > >>>> reasoning, but I can't help but have my spidey senses > > tingle > > > > when > > > > > we > > > > > > > >>>> introduce a scan-per-message architecture. > > > > > > > >>>> > > > > > > > >>>> Casey > > > > > > > >>>> > > > > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov < > > > > > > > >> dima.koval...@sstech.us> > > > > > > > >>>> wrote: > > > > > > > >>>> > > > > > > > >>>>> Hello Justin, > > > > > > > >>>>> > > > > > > > >>>>> Considering that Metron uses hbase tables for storing > > > > enrichment > > > > > > and > > > > > > > >>>>> threatintel feeds, can we use Hbase for geo > enrichment > > as well? > > > > > > > >>>>> Or MapDB can be used for enrichment and threatintel > > feeds > > > > instead > > > > > > of > > > > > > > >>>> hbase? > > > > > > > >>>>> > > > > > > > >>>>> - Dima > > > > > > > >>>>> > > > > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote: > > > > > > > >>>>>> Hi all, > > > > > > > >>>>>> > > > > > > > >>>>>> As a bit of background, right now, GeoIP data is > > loaded into > > > > and > > > > > > > >>>> managed > > > > > > > >>>>> by > > > > > > > >>>>>> MySQL (the connectors are LGPL licensed and we need > > to sever > > > > our > > > > > > > Maven > > > > > > > >>>>>> dependency on it before next release). We currently > > depend on > > > > > and > > > > > > > >>>> install > > > > > > > >>>>>> an instance of MySQL (in each of the Management > Pack, > > Ansible, > > > > > and > > > > > > > >>>> Docker > > > > > > > >>>>>> installs). In the topology, we use the JDBCAdapter > to > > connect > > > > to > > > > > > > MySQL > > > > > > > >>>>> and > > > > > > > >>>>>> query for a given IP. Additionally, it's a single > > point of > > > > > > failure > > > > > > > >> for > > > > > > > >>>>>> that particular enrichment right now. If MySQL is > > down, geo > > > > > > > >> enrichment > > > > > > > >>>>>> can't occur. > > > > > > > >>>>>> > > > > > > > >>>>>> I'm proposing that we eliminate the use of MySQL > > entirely, > > > > > through > > > > > > > all > > > > > > > >>>>>> installation paths (which, unless I missed some, > > includes > > > > > Ansible, > > > > > > > the > > > > > > > >>>>>> Ambari Management Pack, and Docker). We'd do this by > > dropping > > > > > all > > > > > > > the > > > > > > > >>>>>> various MySQL setup and management through the code, > > along > > > > with > > > > > > all > > > > > > > >> the > > > > > > > >>>>>> DDL, etc. The JDBCAdapter would stay, so that > anybody > > who > > > > wants > > > > > > to > > > > > > > >>>> setup > > > > > > > >>>>>> their own databases for enrichments and install > > connectors is > > > > > able > > > > > > > to > > > > > > > >>>> do > > > > > > > >>>>> so. > > > > > > > >>>>>> > > > > > > > >>>>>> In its place, I've looked at using MapDB, which is a > > really > > > > easy > > > > > > to > > > > > > > >> use > > > > > > > >>>>>> library for creating Java collections backed by a > > file (This > > > > is > > > > > > NOT > > > > > > > a > > > > > > > >>>>>> separate installation of anything, it's just a jar > > that > > > > manages > > > > > > > >>>>> interaction > > > > > > > >>>>>> with the file system). Given the slow churn of the > > GeoIP > > > > files > > > > > (I > > > > > > > >>>>> believe > > > > > > > >>>>>> they get updated once a week), we can have a script > > that can > > > > be > > > > > > run > > > > > > > >>>> when > > > > > > > >>>>>> needed, downloads the MaxMind tar file, builds the > > MapDB file > > > > > that > > > > > > > >> will > > > > > > > >>>>> be > > > > > > > >>>>>> used by the bolts, and places it into HDFS. Finally, > > we > > > > update > > > > > a > > > > > > > >>>> config > > > > > > > >>>>> to > > > > > > > >>>>>> point to the new file, the bolts get the updated > > config > > > > callback > > > > > > and > > > > > > > >>>> can > > > > > > > >>>>>> update their db files. Inside the code, we wrap the > > MapDB > > > > > > portions > > > > > > > to > > > > > > > >>>>> make > > > > > > > >>>>>> it transparent to downstream code. > > > > > > > >>>>>> > > > > > > > >>>>>> The particularly nice parts about using MapDB are > > that its > > > > ease > > > > > of > > > > > > > use > > > > > > > >>>>> plus > > > > > > > >>>>>> it offers the utilities we need out of the box to be > > able to > > > > > > support > > > > > > > >>>> the > > > > > > > >>>>>> operations we need on this (Keep in mind the GeoIP > > files use > > > > IP > > > > > > > ranges > > > > > > > >>>>> and > > > > > > > >>>>>> we need to be able to easily grab the appropriate > > range). > > > > > > > >>>>>> > > > > > > > >>>>>> The main point of concern I have about this is that > > when we > > > > grab > > > > > > the > > > > > > > >>>> HDFS > > > > > > > >>>>>> file during an update, given that multiple JVMs can > be > > > > running, > > > > > we > > > > > > > >>>> don't > > > > > > > >>>>>> want them to clobber each other. I believe this can > > be avoided > > > > > by > > > > > > > >>>> simply > > > > > > > >>>>>> using each worker's working directory to store the > > file (and > > > > > > > >>>>> appropriately > > > > > > > >>>>>> ensure threads on the same JVM manage > > multithreading). This > > > > > > should > > > > > > > >>>> keep > > > > > > > >>>>>> the JVMs (and the underlying DB files) entirely > > independent. > > > > > > > >>>>>> > > > > > > > >>>>>> This script would get called by the various > > installations > > > > during > > > > > > > >>>> startup > > > > > > > >>>>> to > > > > > > > >>>>>> do the initial setup. After install, it can then be > > called on > > > > > > > demand > > > > > > > >>>> in > > > > > > > >>>>>> order. > > > > > > > >>>>>> > > > > > > > >>>>>> At this point, we should be all set, with everything > > running > > > > and > > > > > > > >>>>> updatable. > > > > > > > >>>>>> > > > > > > > >>>>>> Justin > > > > > > > >>>>>> > > > > > > > >>>>> > > > > > > > >>>>> > > > > > > > >>>> > > > > > > > >> > > > > > > > >> > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > ------------------- > > Thank you, > > > > James Sirota > > PPMC- Apache Metron (Incubating) > > jsirota AT apache DOT org > > > -- Nick Allen <n...@nickallen.org>