Re: [DISCUSS] Moving GeoIP management away from MySQL

Kyle Richardson Mon, 16 Jan 2017 09:53:16 -0800

+1 Agree with David's order

-Kyle


On Mon, Jan 16, 2017 at 12:41 PM, David Lyle <dlyle65...@gmail.com> wrote:

> Def agree on the parity point.
>
> I'm a little worried about Supervisor relocations for non-HBase solutions,
> but having much of the work done for us by MaxMind changes my preference to
> (in order)
>
> 1) MM API
> 2) HBase Enrichment
> 3) MapDB should the others prove not feasible
>
>
> -D...
>
>
> On Mon, Jan 16, 2017 at 12:15 PM, Justin Leet <justinjl...@gmail.com>
> wrote:
>
> > I definitely agree on checking out the MaxMind API.  I'll take a look at
> > it, but at first glance it looks like it does include everything we use.
> > Great find, JJ.
> >
> > More details on various people's points:
> >
> > As a note to anyone hopping in, Simon's point on the range lookup vs a
> key
> > lookup is why it becomes a Scan in HBase vs a Get.  As an addendum to
> what
> > Simon mentioned, denormalizing is easy enough and turns it into an easy
> > range lookup.
> >
> > To David's point, the MapDB does require a network hop, but it's once per
> > refresh of the data (Got a relevant callback? Grab new data, load it,
> swap
> > out) instead of (up to) once per message.  I would expect the same to be
> > true of the MaxMind db files.
> >
> > I'd also argue MapDB not really more complex than refreshing the HBase
> > table, because we potentially have to start worrying about things like
> > hashing and/or indices and even just general data represtation. It's
> > definitely correct that the file processing has to occur on either path,
> so
> > it really boils down to handling the callback and reloading the file vs
> > handling some of the standard HBasey things.  I don't think either is an
> > enormous amount of work (and both are almost certainly more work than
> > MaxMind's API)
> >
> > Regarding extensibility, I'd argue for parity with what we have first,
> then
> > build what we need from there.  Does anybody have any disagreement with
> > that approach for right now?
> >
> > Justin
> >
> > On Mon, Jan 16, 2017 at 12:04 PM, David Lyle <dlyle65...@gmail.com>
> wrote:
> >
> > > It is interesting- save us a ton of effort, and has the right license.
> I
> > > think it's worth at least checking out.
> > >
> > > -D...
> > >
> > >
> > > On Mon, Jan 16, 2017 at 12:00 PM, Simon Elliston Ball <
> > > si...@simonellistonball.com> wrote:
> > >
> > > > I like that approach even more. That way we would only have to worry
> > > about
> > > > distributing the database file in binary format to all the supervisor
> > > nodes
> > > > on update.
> > > >
> > > > It would also make it easier for people to switch to the enterprise
> DB
> > > > potentially if they had the license.
> > > >
> > > > One slight issue with this might be for people who wanted to extend
> the
> > > > database. For example, organisations may want to add geo-enrichment
> to
> > > > their own private network addresses based modified versions of the
> geo
> > > > database. Currently we don’t really allow this, since we hard-code
> > > ignoring
> > > > private network classes into the geo enrichment adapter, but I can
> see
> > a
> > > > case where a global org might want to add their own ranges and
> > locations
> > > to
> > > > the data set. Does that make sense to anyone else?
> > > >
> > > > Simon
> > > >
> > > >
> > > > > On 16 Jan 2017, at 16:50, JJ Meyer <jjmey...@gmail.com> wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > > Can we leverage maxmind's Java client (
> > > > > https://github.com/maxmind/GeoIP2-java/tree/master/src/
> > > > main/java/com/maxmind/geoip2)
> > > > > in this case? I believe it can directly read maxmind file. Plus I
> > think
> > > > it
> > > > > also has some support for caching as well.
> > > > >
> > > > > Thanks,
> > > > > JJ
> > > > >
> > > > > On Mon, Jan 16, 2017 at 10:32 AM, Simon Elliston Ball <
> > > > > si...@simonellistonball.com> wrote:
> > > > >
> > > > >> I like the idea of MapDB, since we can essentially pull an
> instance
> > > into
> > > > >> each supervisor, so it makes a lot of sense for relatively small
> > > scale,
> > > > >> relatively static enrichments in general.
> > > > >>
> > > > >> Generally this feels like a caching problem, and would be for a
> > simple
> > > > >> key-value lookup. In that case I would agree with David Lyle on
> > using
> > > > HBase
> > > > >> as a source or truth and relying on caching.
> > > > >>
> > > > >> That said, GeoIP is a different lookup pattern, since it’s a range
> > > > lookup
> > > > >> then a key lookup (or if we denormalize the MaxMind data, just a
> > range
> > > > >> lookup) for that kind of thing MapDB with something like the BTree
> > > > seems a
> > > > >> good fit.
> > > > >>
> > > > >> Simon
> > > > >>
> > > > >>
> > > > >>> On 16 Jan 2017, at 16:28, David Lyle <dlyle65...@gmail.com>
> wrote:
> > > > >>>
> > > > >>> I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see
> it
> > > as
> > > > an
> > > > >>> HBase enrichment. If our current caching isn't enough to mitigate
> > the
> > > > >> above
> > > > >>> issues, we have a problem, don't we? Or do we not recommend HBase
> > > > >>> enrichment for per message enrichment in general?
> > > > >>>
> > > > >>> Also- can you elaborate on how MapDB would not require a network
> > hop?
> > > > >>> Doesn't this mean we would have to sync the enrichment data to
> each
> > > > Storm
> > > > >>> supervisor? HDFS could (probably would) have a network hop too,
> no?
> > > > >>>
> > > > >>> Fwiw -
> > > > >>> "In its place, I've looked at using MapDB, which is a really easy
> > to
> > > > use
> > > > >>> library for creating Java collections backed by a file (This is
> > NOT a
> > > > >>> separate installation of anything, it's just a jar that manages
> > > > >> interaction
> > > > >>> with the file system).  Given the slow churn of the GeoIP files
> (I
> > > > >> believe
> > > > >>> they get updated once a week), we can have a script that can be
> run
> > > > when
> > > > >>> needed, downloads the MaxMind tar file, builds the MapDB file
> that
> > > will
> > > > >> be
> > > > >>> used by the bolts, and places it into HDFS.  Finally, we update a
> > > > config
> > > > >> to
> > > > >>> point to the new file, the bolts get the updated config callback
> > and
> > > > can
> > > > >>> update their db files.  Inside the code, we wrap the MapDB
> portions
> > > to
> > > > >> make
> > > > >>> it transparent to downstream code."
> > > > >>>
> > > > >>> Seems a bit more complex than "refresh the hbase table". Afaik,
> > > either
> > > > >>> approach would require some sort of translation between GeoIP
> > source
> > > > >> format
> > > > >>> and target format, so that part is a wash imo.
> > > > >>>
> > > > >>> So, I'd really like to see, at least, an attempt to leverage
> HBase
> > > > >>> enrichment.
> > > > >>>
> > > > >>> -D...
> > > > >>>
> > > > >>>
> > > > >>> On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <
> ceste...@gmail.com
> > >
> > > > >> wrote:
> > > > >>>
> > > > >>>> I think that it's a sensible thing to use MapDB for the geo
> > > > enrichment.
> > > > >>>> Let me state my reasoning:
> > > > >>>>
> > > > >>>>  - An HBase implementation  would necessitate a HBase scan
> > possibly
> > > > >>>>  hitting HDFS, which is expensive per-message.
> > > > >>>>  - An HBase implementation would necessitate a network hop and
> > MapDB
> > > > >>>>  would not.
> > > > >>>>
> > > > >>>> I also think this might be the beginning of a more general
> purpose
> > > > >> support
> > > > >>>> in Stellar for locally shipped, read-only MapDB lookups, which
> > might
> > > > be
> > > > >>>> interesting.
> > > > >>>>
> > > > >>>> In short, all quotes about premature optimization are sure to
> > apply
> > > to
> > > > >> my
> > > > >>>> reasoning, but I can't help but have my spidey senses tingle
> when
> > we
> > > > >>>> introduce a scan-per-message architecture.
> > > > >>>>
> > > > >>>> Casey
> > > > >>>>
> > > > >>>> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> > > > >> dima.koval...@sstech.us>
> > > > >>>> wrote:
> > > > >>>>
> > > > >>>>> Hello Justin,
> > > > >>>>>
> > > > >>>>> Considering that Metron uses hbase tables for storing
> enrichment
> > > and
> > > > >>>>> threatintel feeds, can we use Hbase for geo enrichment as well?
> > > > >>>>> Or MapDB can be used for enrichment and threatintel feeds
> instead
> > > of
> > > > >>>> hbase?
> > > > >>>>>
> > > > >>>>> - Dima
> > > > >>>>>
> > > > >>>>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> > > > >>>>>> Hi all,
> > > > >>>>>>
> > > > >>>>>> As a bit of background, right now, GeoIP data is loaded into
> and
> > > > >>>> managed
> > > > >>>>> by
> > > > >>>>>> MySQL (the connectors are LGPL licensed and we need to sever
> our
> > > > Maven
> > > > >>>>>> dependency on it before next release). We currently depend on
> > and
> > > > >>>> install
> > > > >>>>>> an instance of MySQL (in each of the Management Pack, Ansible,
> > and
> > > > >>>> Docker
> > > > >>>>>> installs). In the topology, we use the JDBCAdapter to connect
> to
> > > > MySQL
> > > > >>>>> and
> > > > >>>>>> query for a given IP.  Additionally, it's a single point of
> > > failure
> > > > >> for
> > > > >>>>>> that particular enrichment right now.  If MySQL is down, geo
> > > > >> enrichment
> > > > >>>>>> can't occur.
> > > > >>>>>>
> > > > >>>>>> I'm proposing that we eliminate the use of MySQL entirely,
> > through
> > > > all
> > > > >>>>>> installation paths (which, unless I missed some, includes
> > Ansible,
> > > > the
> > > > >>>>>> Ambari Management Pack, and Docker).  We'd do this by dropping
> > all
> > > > the
> > > > >>>>>> various MySQL setup and management through the code, along
> with
> > > all
> > > > >> the
> > > > >>>>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who
> wants
> > > to
> > > > >>>> setup
> > > > >>>>>> their own databases for enrichments and install connectors is
> > able
> > > > to
> > > > >>>> do
> > > > >>>>> so.
> > > > >>>>>>
> > > > >>>>>> In its place, I've looked at using MapDB, which is a really
> easy
> > > to
> > > > >> use
> > > > >>>>>> library for creating Java collections backed by a file (This
> is
> > > NOT
> > > > a
> > > > >>>>>> separate installation of anything, it's just a jar that
> manages
> > > > >>>>> interaction
> > > > >>>>>> with the file system).  Given the slow churn of the GeoIP
> files
> > (I
> > > > >>>>> believe
> > > > >>>>>> they get updated once a week), we can have a script that can
> be
> > > run
> > > > >>>> when
> > > > >>>>>> needed, downloads the MaxMind tar file, builds the MapDB file
> > that
> > > > >> will
> > > > >>>>> be
> > > > >>>>>> used by the bolts, and places it into HDFS.  Finally, we
> update
> > a
> > > > >>>> config
> > > > >>>>> to
> > > > >>>>>> point to the new file, the bolts get the updated config
> callback
> > > and
> > > > >>>> can
> > > > >>>>>> update their db files.  Inside the code, we wrap the MapDB
> > > portions
> > > > to
> > > > >>>>> make
> > > > >>>>>> it transparent to downstream code.
> > > > >>>>>>
> > > > >>>>>> The particularly nice parts about using MapDB are that its
> ease
> > of
> > > > use
> > > > >>>>> plus
> > > > >>>>>> it offers the utilities we need out of the box to be able to
> > > support
> > > > >>>> the
> > > > >>>>>> operations we need on this (Keep in mind the GeoIP files use
> IP
> > > > ranges
> > > > >>>>> and
> > > > >>>>>> we need to be able to easily grab the appropriate range).
> > > > >>>>>>
> > > > >>>>>> The main point of concern I have about this is that when we
> grab
> > > the
> > > > >>>> HDFS
> > > > >>>>>> file during an update, given that multiple JVMs can be
> running,
> > we
> > > > >>>> don't
> > > > >>>>>> want them to clobber each other. I believe this can be avoided
> > by
> > > > >>>> simply
> > > > >>>>>> using each worker's working directory to store the file (and
> > > > >>>>> appropriately
> > > > >>>>>> ensure threads on the same JVM manage multithreading).  This
> > > should
> > > > >>>> keep
> > > > >>>>>> the JVMs (and the underlying DB files) entirely independent.
> > > > >>>>>>
> > > > >>>>>> This script would get called by the various installations
> during
> > > > >>>> startup
> > > > >>>>> to
> > > > >>>>>> do the initial setup.  After install, it can then be called on
> > > > demand
> > > > >>>> in
> > > > >>>>>> order.
> > > > >>>>>>
> > > > >>>>>> At this point, we should be all set, with everything running
> and
> > > > >>>>> updatable.
> > > > >>>>>>
> > > > >>>>>> Justin
> > > > >>>>>>
> > > > >>>>>
> > > > >>>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > >
> > > >
> > >
> >
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to