Re: [DISCUSS] Moving GeoIP management away from MySQL

Nick Allen Mon, 16 Jan 2017 13:12:01 -0800

+1 to using the Java API with the MMDB file provided by Maxmind.  This is
what I had thought we were doing when we discussed this a few months back.
I'd rather use the Maxmind tools as-provided instead of engineering
something on top of it.


On Mon, Jan 16, 2017 at 3:59 PM, JJ Meyer <jjmey...@gmail.com> wrote:

> Matt, I agree with your points on why we shouldn't just get rid of the
> database just to get rid of a database. But IMO, I think we may be
> reinventing the wheel a little bit by even putting the maxmind data into
> MySQL. Right now we are already downloading a maxmind file. To me it seems
> simpler to push the file to HDFS where we can pick it up and have the
> maxmind client use that instead of importing data into a DB and then
> running a query. Also, I believe the data gets updated weekly. So syncing
> may become easier too.
>
> James, I believe it works with the paid and free versions of geoip. I know
> NiFi uses this client library in their Geo enrichment processor.
>
> Also, if it is decided that using a SQL database is still the best
> solution, I think there is a benefit to using their library. We would just
> have to implement a `DatabaseProvider` that hits some SQL db instead of
> using their standard implementation.
>
> Thanks,
> JJ
>
> On Mon, Jan 16, 2017 at 2:27 PM, James Sirota <jsir...@apache.org> wrote:
>
> > Hi Guys, I just wanted to clarify one point that I think is lost in this
> > tread.  Geo enrichment is NOT a key-value enrichment.  It requires a
> range
> > scan and a join (which is why it's implemented via mySql and not Hbase).
> > To account for this access pattern via a key-value store you would
> > inevitably have to do something funky or in case of Hbase I don't think
> > there is a way to avoid doing a range scan.
> >
> > With respect to mapdb it only has support for Maps, Sets, Lists, Queues.
> > Are we sure it provides enough functionality for us to do this
> enrichment?
> >
> > With respect to the Maxmind client, are we sure we can use it on the
> > mySql-backed version of their DB?  I thought the Maxmind database itself
> is
> > proprietary and is something you have to pay for.  My understanding is
> that
> > the client is designed for that proprietary version.
> >
> > I somewhat agree with Matt's point.  If mySql is a problem because of
> > licensing, the path of least resistance to remove mySql dependencies
> would
> > be to simply switch to postgresql.  We will always have conventional sql
> > databases in our stack because other big data tools use them. Why not
> take
> > advantage of them too?
> >
> > Thanks,
> > James
> >
> > 16.01.2017, 12:27, "Matt Foley" <ma...@apache.org>:
> > > Hi Justin, and team,
> > > Several components of the Hadoop Stack utilize a SQL database, usually
> > for metadata of some sort. Ambari knows this and arranges for them to
> share
> > a single database installation (on or off the cluster), unless they
> > explicitly configure use of different databases (which is allowed for
> sites
> > that desire it). Ambari defaults to using PostgreSQL, altho it’s happy to
> > use MySQL, Oracle, or Microsoft, along with whatever each component
> > historically defined as their default (such as Derby).
> > >
> > > If we want to start with a replacement of current functionality, I
> would
> > suggest switching the default database to PostgreSQL. Replacing fast,
> > efficient, and proven db services with a file-based api library (but no
> > standard way to propagate the underlying storage files) seems to me to be
> > taking a step backwards.
> > >
> > > Sticking with a SQL-based service will surely minimize the amount of
> > code changes needed. And making the SQL either dialect-independent or
> > capable of switching among dialects, then enables us to do what the rest
> of
> > the Hadoop stack does: allow enterprise customers to substitute Oracle or
> > Microsoft enterprise-class databases where they wish. Regarding the
> > drivers, we should study what the other Stack components do; I’m not an
> > expert in those areas.
> > >
> > > Using the same db as the rest of the stack also means administrators
> can
> > be confident they’ve set up adequate backup and recovery processes.
> > > All these are valuable reasons not to roll our own storage system for
> > this enrichment data. IMO, of course.
> > >
> > > Cheers,
> > > --Matt
> > >
> > > On 1/16/17, 9:52 AM, "Kyle Richardson" <kylerichards...@gmail.com>
> > wrote:
> > >
> > >     +1 Agree with David's order
> > >
> > >     -Kyle
> > >
> > >     On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65...@gmail.com
> >
> > wrote:
> > >
> > >     > Def agree on the parity point.
> > >     >
> > >     > I'm a little worried about Supervisor relocations for non-HBase
> > solutions,
> > >     > but having much of the work done for us by MaxMind changes my
> > preference to
> > >     > (in order)
> > >     >
> > >     > 1) MM API
> > >     > 2) HBase Enrichment
> > >     > 3) MapDB should the others prove not feasible
> > >     >
> > >     >
> > >     > -D...
> > >     >
> > >     >
> > >     > On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <
> > justinjl...@gmail.com>
> > >     > wrote:
> > >     >
> > >     > > I definitely agree on checking out the MaxMind API. I'll take a
> > look at
> > >     > > it, but at first glance it looks like it does include
> everything
> > we use.
> > >     > > Great find, JJ.
> > >     > >
> > >     > > More details on various people's points:
> > >     > >
> > >     > > As a note to anyone hopping in, Simon's point on the range
> > lookup vs a
> > >     > key
> > >     > > lookup is why it becomes a Scan in HBase vs a Get. As an
> > addendum to
> > >     > what
> > >     > > Simon mentioned, denormalizing is easy enough and turns it into
> > an easy
> > >     > > range lookup.
> > >     > >
> > >     > > To David's point, the MapDB does require a network hop, but
> it's
> > once per
> > >     > > refresh of the data (Got a relevant callback? Grab new data,
> > load it,
> > >     > swap
> > >     > > out) instead of (up to) once per message. I would expect the
> > same to be
> > >     > > true of the MaxMind db files.
> > >     > >
> > >     > > I'd also argue MapDB not really more complex than refreshing
> the
> > HBase
> > >     > > table, because we potentially have to start worrying about
> > things like
> > >     > > hashing and/or indices and even just general data represtation.
> > It's
> > >     > > definitely correct that the file processing has to occur on
> > either path,
> > >     > so
> > >     > > it really boils down to handling the callback and reloading the
> > file vs
> > >     > > handling some of the standard HBasey things. I don't think
> > either is an
> > >     > > enormous amount of work (and both are almost certainly more
> work
> > than
> > >     > > MaxMind's API)
> > >     > >
> > >     > > Regarding extensibility, I'd argue for parity with what we have
> > first,
> > >     > then
> > >     > > build what we need from there. Does anybody have any
> > disagreement with
> > >     > > that approach for right now?
> > >     > >
> > >     > > Justin
> > >     > >
> > >     > > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <
> > dlyle65...@gmail.com>
> > >     > wrote:
> > >     > >
> > >     > > > It is interesting- save us a ton of effort, and has the right
> > license.
> > >     > I
> > >     > > > think it's worth at least checking out.
> > >     > > >
> > >     > > > -D...
> > >     > > >
> > >     > > >
> > >     > > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > >     > > > si...@simonellistonball.com> wrote:
> > >     > > >
> > >     > > > > I like that approach even more. That way we would only have
> > to worry
> > >     > > > about
> > >     > > > > distributing the database file in binary format to all the
> > supervisor
> > >     > > > nodes
> > >     > > > > on update.
> > >     > > > >
> > >     > > > > It would also make it easier for people to switch to the
> > enterprise
> > >     > DB
> > >     > > > > potentially if they had the license.
> > >     > > > >
> > >     > > > > One slight issue with this might be for people who wanted
> to
> > extend
> > >     > the
> > >     > > > > database. For example, organisations may want to add
> > geo-enrichment
> > >     > to
> > >     > > > > their own private network addresses based modified versions
> > of the
> > >     > geo
> > >     > > > > database. Currently we don’t really allow this, since we
> > hard-code
> > >     > > > ignoring
> > >     > > > > private network classes into the geo enrichment adapter,
> but
> > I can
> > >     > see
> > >     > > a
> > >     > > > > case where a global org might want to add their own ranges
> > and
> > >     > > locations
> > >     > > > to
> > >     > > > > the data set. Does that make sense to anyone else?
> > >     > > > >
> > >     > > > > Simon
> > >     > > > >
> > >     > > > >
> > >     > > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmey...@gmail.com>
> > wrote:
> > >     > > > > >
> > >     > > > > > Hello all,
> > >     > > > > >
> > >     > > > > > Can we leverage maxmind's Java client (
> > >     > > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > >     > > > > main/java/com/maxmind/geoip2)
> > >     > > > > > in this case? I believe it can directly read maxmind
> file.
> > Plus I
> > >     > > think
> > >     > > > > it
> > >     > > > > > also has some support for caching as well.
> > >     > > > > >
> > >     > > > > > Thanks,
> > >     > > > > > JJ
> > >     > > > > >
> > >     > > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > >     > > > > > si...@simonellistonball.com> wrote:
> > >     > > > > >
> > >     > > > > >> I like the idea of MapDB, since we can essentially pull
> an
> > >     > instance
> > >     > > > into
> > >     > > > > >> each supervisor, so it makes a lot of sense for
> > relatively small
> > >     > > > scale,
> > >     > > > > >> relatively static enrichments in general.
> > >     > > > > >>
> > >     > > > > >> Generally this feels like a caching problem, and would
> be
> > for a
> > >     > > simple
> > >     > > > > >> key-value lookup. In that case I would agree with David
> > Lyle on
> > >     > > using
> > >     > > > > HBase
> > >     > > > > >> as a source or truth and relying on caching.
> > >     > > > > >>
> > >     > > > > >> That said, GeoIP is a different lookup pattern, since
> > it’s a range
> > >     > > > > lookup
> > >     > > > > >> then a key lookup (or if we denormalize the MaxMind
> data,
> > just a
> > >     > > range
> > >     > > > > >> lookup) for that kind of thing MapDB with something like
> > the BTree
> > >     > > > > seems a
> > >     > > > > >> good fit.
> > >     > > > > >>
> > >     > > > > >> Simon
> > >     > > > > >>
> > >     > > > > >>
> > >     > > > > >>> On 16 Jan 2017, at 16:28, David Lyle <
> > dlyle65...@gmail.com>
> > >     > wrote:
> > >     > > > > >>>
> > >     > > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd
> > prefer to see
> > >     > it
> > >     > > > as
> > >     > > > > an
> > >     > > > > >>> HBase enrichment. If our current caching isn't enough
> to
> > mitigate
> > >     > > the
> > >     > > > > >> above
> > >     > > > > >>> issues, we have a problem, don't we? Or do we not
> > recommend HBase
> > >     > > > > >>> enrichment for per message enrichment in general?
> > >     > > > > >>>
> > >     > > > > >>> Also- can you elaborate on how MapDB would not require
> a
> > network
> > >     > > hop?
> > >     > > > > >>> Doesn't this mean we would have to sync the enrichment
> > data to
> > >     > each
> > >     > > > > Storm
> > >     > > > > >>> supervisor? HDFS could (probably would) have a network
> > hop too,
> > >     > no?
> > >     > > > > >>>
> > >     > > > > >>> Fwiw -
> > >     > > > > >>> "In its place, I've looked at using MapDB, which is a
> > really easy
> > >     > > to
> > >     > > > > use
> > >     > > > > >>> library for creating Java collections backed by a file
> > (This is
> > >     > > NOT a
> > >     > > > > >>> separate installation of anything, it's just a jar that
> > manages
> > >     > > > > >> interaction
> > >     > > > > >>> with the file system). Given the slow churn of the
> GeoIP
> > files
> > >     > (I
> > >     > > > > >> believe
> > >     > > > > >>> they get updated once a week), we can have a script
> that
> > can be
> > >     > run
> > >     > > > > when
> > >     > > > > >>> needed, downloads the MaxMind tar file, builds the
> MapDB
> > file
> > >     > that
> > >     > > > will
> > >     > > > > >> be
> > >     > > > > >>> used by the bolts, and places it into HDFS. Finally, we
> > update a
> > >     > > > > config
> > >     > > > > >> to
> > >     > > > > >>> point to the new file, the bolts get the updated config
> > callback
> > >     > > and
> > >     > > > > can
> > >     > > > > >>> update their db files. Inside the code, we wrap the
> MapDB
> > >     > portions
> > >     > > > to
> > >     > > > > >> make
> > >     > > > > >>> it transparent to downstream code."
> > >     > > > > >>>
> > >     > > > > >>> Seems a bit more complex than "refresh the hbase
> table".
> > Afaik,
> > >     > > > either
> > >     > > > > >>> approach would require some sort of translation between
> > GeoIP
> > >     > > source
> > >     > > > > >> format
> > >     > > > > >>> and target format, so that part is a wash imo.
> > >     > > > > >>>
> > >     > > > > >>> So, I'd really like to see, at least, an attempt to
> > leverage
> > >     > HBase
> > >     > > > > >>> enrichment.
> > >     > > > > >>>
> > >     > > > > >>> -D...
> > >     > > > > >>>
> > >     > > > > >>>
> > >     > > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
> > >     > ceste...@gmail.com
> > >     > > >
> > >     > > > > >> wrote:
> > >     > > > > >>>
> > >     > > > > >>>> I think that it's a sensible thing to use MapDB for
> the
> > geo
> > >     > > > > enrichment.
> > >     > > > > >>>> Let me state my reasoning:
> > >     > > > > >>>>
> > >     > > > > >>>> - An HBase implementation would necessitate a HBase
> scan
> > >     > > possibly
> > >     > > > > >>>> hitting HDFS, which is expensive per-message.
> > >     > > > > >>>> - An HBase implementation would necessitate a network
> > hop and
> > >     > > MapDB
> > >     > > > > >>>> would not.
> > >     > > > > >>>>
> > >     > > > > >>>> I also think this might be the beginning of a more
> > general
> > >     > purpose
> > >     > > > > >> support
> > >     > > > > >>>> in Stellar for locally shipped, read-only MapDB
> > lookups, which
> > >     > > might
> > >     > > > > be
> > >     > > > > >>>> interesting.
> > >     > > > > >>>>
> > >     > > > > >>>> In short, all quotes about premature optimization are
> > sure to
> > >     > > apply
> > >     > > > to
> > >     > > > > >> my
> > >     > > > > >>>> reasoning, but I can't help but have my spidey senses
> > tingle
> > >     > when
> > >     > > we
> > >     > > > > >>>> introduce a scan-per-message architecture.
> > >     > > > > >>>>
> > >     > > > > >>>> Casey
> > >     > > > > >>>>
> > >     > > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > >     > > > > >> dima.koval...@sstech.us>
> > >     > > > > >>>> wrote:
> > >     > > > > >>>>
> > >     > > > > >>>>> Hello Justin,
> > >     > > > > >>>>>
> > >     > > > > >>>>> Considering that Metron uses hbase tables for storing
> > >     > enrichment
> > >     > > > and
> > >     > > > > >>>>> threatintel feeds, can we use Hbase for geo
> enrichment
> > as well?
> > >     > > > > >>>>> Or MapDB can be used for enrichment and threatintel
> > feeds
> > >     > instead
> > >     > > > of
> > >     > > > > >>>> hbase?
> > >     > > > > >>>>>
> > >     > > > > >>>>> - Dima
> > >     > > > > >>>>>
> > >     > > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > >     > > > > >>>>>> Hi all,
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> As a bit of background, right now, GeoIP data is
> > loaded into
> > >     > and
> > >     > > > > >>>> managed
> > >     > > > > >>>>> by
> > >     > > > > >>>>>> MySQL (the connectors are LGPL licensed and we need
> > to sever
> > >     > our
> > >     > > > > Maven
> > >     > > > > >>>>>> dependency on it before next release). We currently
> > depend on
> > >     > > and
> > >     > > > > >>>> install
> > >     > > > > >>>>>> an instance of MySQL (in each of the Management
> Pack,
> > Ansible,
> > >     > > and
> > >     > > > > >>>> Docker
> > >     > > > > >>>>>> installs). In the topology, we use the JDBCAdapter
> to
> > connect
> > >     > to
> > >     > > > > MySQL
> > >     > > > > >>>>> and
> > >     > > > > >>>>>> query for a given IP. Additionally, it's a single
> > point of
> > >     > > > failure
> > >     > > > > >> for
> > >     > > > > >>>>>> that particular enrichment right now. If MySQL is
> > down, geo
> > >     > > > > >> enrichment
> > >     > > > > >>>>>> can't occur.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> I'm proposing that we eliminate the use of MySQL
> > entirely,
> > >     > > through
> > >     > > > > all
> > >     > > > > >>>>>> installation paths (which, unless I missed some,
> > includes
> > >     > > Ansible,
> > >     > > > > the
> > >     > > > > >>>>>> Ambari Management Pack, and Docker). We'd do this by
> > dropping
> > >     > > all
> > >     > > > > the
> > >     > > > > >>>>>> various MySQL setup and management through the code,
> > along
> > >     > with
> > >     > > > all
> > >     > > > > >> the
> > >     > > > > >>>>>> DDL, etc. The JDBCAdapter would stay, so that
> anybody
> > who
> > >     > wants
> > >     > > > to
> > >     > > > > >>>> setup
> > >     > > > > >>>>>> their own databases for enrichments and install
> > connectors is
> > >     > > able
> > >     > > > > to
> > >     > > > > >>>> do
> > >     > > > > >>>>> so.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> In its place, I've looked at using MapDB, which is a
> > really
> > >     > easy
> > >     > > > to
> > >     > > > > >> use
> > >     > > > > >>>>>> library for creating Java collections backed by a
> > file (This
> > >     > is
> > >     > > > NOT
> > >     > > > > a
> > >     > > > > >>>>>> separate installation of anything, it's just a jar
> > that
> > >     > manages
> > >     > > > > >>>>> interaction
> > >     > > > > >>>>>> with the file system). Given the slow churn of the
> > GeoIP
> > >     > files
> > >     > > (I
> > >     > > > > >>>>> believe
> > >     > > > > >>>>>> they get updated once a week), we can have a script
> > that can
> > >     > be
> > >     > > > run
> > >     > > > > >>>> when
> > >     > > > > >>>>>> needed, downloads the MaxMind tar file, builds the
> > MapDB file
> > >     > > that
> > >     > > > > >> will
> > >     > > > > >>>>> be
> > >     > > > > >>>>>> used by the bolts, and places it into HDFS. Finally,
> > we
> > >     > update
> > >     > > a
> > >     > > > > >>>> config
> > >     > > > > >>>>> to
> > >     > > > > >>>>>> point to the new file, the bolts get the updated
> > config
> > >     > callback
> > >     > > > and
> > >     > > > > >>>> can
> > >     > > > > >>>>>> update their db files. Inside the code, we wrap the
> > MapDB
> > >     > > > portions
> > >     > > > > to
> > >     > > > > >>>>> make
> > >     > > > > >>>>>> it transparent to downstream code.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> The particularly nice parts about using MapDB are
> > that its
> > >     > ease
> > >     > > of
> > >     > > > > use
> > >     > > > > >>>>> plus
> > >     > > > > >>>>>> it offers the utilities we need out of the box to be
> > able to
> > >     > > > support
> > >     > > > > >>>> the
> > >     > > > > >>>>>> operations we need on this (Keep in mind the GeoIP
> > files use
> > >     > IP
> > >     > > > > ranges
> > >     > > > > >>>>> and
> > >     > > > > >>>>>> we need to be able to easily grab the appropriate
> > range).
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> The main point of concern I have about this is that
> > when we
> > >     > grab
> > >     > > > the
> > >     > > > > >>>> HDFS
> > >     > > > > >>>>>> file during an update, given that multiple JVMs can
> be
> > >     > running,
> > >     > > we
> > >     > > > > >>>> don't
> > >     > > > > >>>>>> want them to clobber each other. I believe this can
> > be avoided
> > >     > > by
> > >     > > > > >>>> simply
> > >     > > > > >>>>>> using each worker's working directory to store the
> > file (and
> > >     > > > > >>>>> appropriately
> > >     > > > > >>>>>> ensure threads on the same JVM manage
> > multithreading). This
> > >     > > > should
> > >     > > > > >>>> keep
> > >     > > > > >>>>>> the JVMs (and the underlying DB files) entirely
> > independent.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> This script would get called by the various
> > installations
> > >     > during
> > >     > > > > >>>> startup
> > >     > > > > >>>>> to
> > >     > > > > >>>>>> do the initial setup. After install, it can then be
> > called on
> > >     > > > > demand
> > >     > > > > >>>> in
> > >     > > > > >>>>>> order.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> At this point, we should be all set, with everything
> > running
> > >     > and
> > >     > > > > >>>>> updatable.
> > >     > > > > >>>>>>
> > >     > > > > >>>>>> Justin
> > >     > > > > >>>>>>
> > >     > > > > >>>>>
> > >     > > > > >>>>>
> > >     > > > > >>>>
> > >     > > > > >>
> > >     > > > > >>
> > >     > > > >
> > >     > > > >
> > >     > > >
> > >     > >
> > >     >
> >
> > -------------------
> > Thank you,
> >
> > James Sirota
> > PPMC- Apache Metron (Incubating)
> > jsirota AT apache DOT org
> >
>



-- 
Nick Allen <n...@nickallen.org>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to