Re: [DISCUSS] Moving GeoIP management away from MySQL

Casey Stella Mon, 16 Jan 2017 09:46:46 -0800

+1 to the point about this being a different lookup pattern.  If this were
able to be done in a multi-get, I'd be all for HBase, but I worry about
scan performance, a historical sore point for HBase architectures.


On Mon, Jan 16, 2017 at 11:32 AM, Simon Elliston Ball <
si...@simonellistonball.com> wrote:

> I like the idea of MapDB, since we can essentially pull an instance into
> each supervisor, so it makes a lot of sense for relatively small scale,
> relatively static enrichments in general.
>
> Generally this feels like a caching problem, and would be for a simple
> key-value lookup. In that case I would agree with David Lyle on using HBase
> as a source or truth and relying on caching.
>
> That said, GeoIP is a different lookup pattern, since it’s a range lookup
> then a key lookup (or if we denormalize the MaxMind data, just a range
> lookup) for that kind of thing MapDB with something like the BTree seems a
> good fit.
>
> Simon
>
>
> > On 16 Jan 2017, at 16:28, David Lyle <dlyle65...@gmail.com> wrote:
> >
> > I'm +1 on removing the MySQL dependency, BUT - I'd prefer to see it as an
> > HBase enrichment. If our current caching isn't enough to mitigate the
> above
> > issues, we have a problem, don't we? Or do we not recommend HBase
> > enrichment for per message enrichment in general?
> >
> > Also- can you elaborate on how MapDB would not require a network hop?
> > Doesn't this mean we would have to sync the enrichment data to each Storm
> > supervisor? HDFS could (probably would) have a network hop too, no?
> >
> > Fwiw -
> > "In its place, I've looked at using MapDB, which is a really easy to use
> > library for creating Java collections backed by a file (This is NOT a
> > separate installation of anything, it's just a jar that manages
> interaction
> > with the file system).  Given the slow churn of the GeoIP files (I
> believe
> > they get updated once a week), we can have a script that can be run when
> > needed, downloads the MaxMind tar file, builds the MapDB file that will
> be
> > used by the bolts, and places it into HDFS.  Finally, we update a config
> to
> > point to the new file, the bolts get the updated config callback and can
> > update their db files.  Inside the code, we wrap the MapDB portions to
> make
> > it transparent to downstream code."
> >
> > Seems a bit more complex than "refresh the hbase table". Afaik, either
> > approach would require some sort of translation between GeoIP source
> format
> > and target format, so that part is a wash imo.
> >
> > So, I'd really like to see, at least, an attempt to leverage HBase
> > enrichment.
> >
> > -D...
> >
> >
> > On Mon, Jan 16, 2017 at 11:02 AM, Casey Stella <ceste...@gmail.com>
> wrote:
> >
> >> I think that it's a sensible thing to use MapDB for the geo enrichment.
> >> Let me state my reasoning:
> >>
> >>   - An HBase implementation  would necessitate a HBase scan possibly
> >>   hitting HDFS, which is expensive per-message.
> >>   - An HBase implementation would necessitate a network hop and MapDB
> >>   would not.
> >>
> >> I also think this might be the beginning of a more general purpose
> support
> >> in Stellar for locally shipped, read-only MapDB lookups, which might be
> >> interesting.
> >>
> >> In short, all quotes about premature optimization are sure to apply to
> my
> >> reasoning, but I can't help but have my spidey senses tingle when we
> >> introduce a scan-per-message architecture.
> >>
> >> Casey
> >>
> >> On Mon, Jan 16, 2017 at 10:53 AM, Dima Kovalyov <
> dima.koval...@sstech.us>
> >> wrote:
> >>
> >>> Hello Justin,
> >>>
> >>> Considering that Metron uses hbase tables for storing enrichment and
> >>> threatintel feeds, can we use Hbase for geo enrichment as well?
> >>> Or MapDB can be used for enrichment and threatintel feeds instead of
> >> hbase?
> >>>
> >>> - Dima
> >>>
> >>> On 01/16/2017 04:17 PM, Justin Leet wrote:
> >>>> Hi all,
> >>>>
> >>>> As a bit of background, right now, GeoIP data is loaded into and
> >> managed
> >>> by
> >>>> MySQL (the connectors are LGPL licensed and we need to sever our Maven
> >>>> dependency on it before next release). We currently depend on and
> >> install
> >>>> an instance of MySQL (in each of the Management Pack, Ansible, and
> >> Docker
> >>>> installs). In the topology, we use the JDBCAdapter to connect to MySQL
> >>> and
> >>>> query for a given IP.  Additionally, it's a single point of failure
> for
> >>>> that particular enrichment right now.  If MySQL is down, geo
> enrichment
> >>>> can't occur.
> >>>>
> >>>> I'm proposing that we eliminate the use of MySQL entirely, through all
> >>>> installation paths (which, unless I missed some, includes Ansible, the
> >>>> Ambari Management Pack, and Docker).  We'd do this by dropping all the
> >>>> various MySQL setup and management through the code, along with all
> the
> >>>> DDL, etc.  The JDBCAdapter would stay, so that anybody who wants to
> >> setup
> >>>> their own databases for enrichments and install connectors is able to
> >> do
> >>> so.
> >>>>
> >>>> In its place, I've looked at using MapDB, which is a really easy to
> use
> >>>> library for creating Java collections backed by a file (This is NOT a
> >>>> separate installation of anything, it's just a jar that manages
> >>> interaction
> >>>> with the file system).  Given the slow churn of the GeoIP files (I
> >>> believe
> >>>> they get updated once a week), we can have a script that can be run
> >> when
> >>>> needed, downloads the MaxMind tar file, builds the MapDB file that
> will
> >>> be
> >>>> used by the bolts, and places it into HDFS.  Finally, we update a
> >> config
> >>> to
> >>>> point to the new file, the bolts get the updated config callback and
> >> can
> >>>> update their db files.  Inside the code, we wrap the MapDB portions to
> >>> make
> >>>> it transparent to downstream code.
> >>>>
> >>>> The particularly nice parts about using MapDB are that its ease of use
> >>> plus
> >>>> it offers the utilities we need out of the box to be able to support
> >> the
> >>>> operations we need on this (Keep in mind the GeoIP files use IP ranges
> >>> and
> >>>> we need to be able to easily grab the appropriate range).
> >>>>
> >>>> The main point of concern I have about this is that when we grab the
> >> HDFS
> >>>> file during an update, given that multiple JVMs can be running, we
> >> don't
> >>>> want them to clobber each other. I believe this can be avoided by
> >> simply
> >>>> using each worker's working directory to store the file (and
> >>> appropriately
> >>>> ensure threads on the same JVM manage multithreading).  This should
> >> keep
> >>>> the JVMs (and the underlying DB files) entirely independent.
> >>>>
> >>>> This script would get called by the various installations during
> >> startup
> >>> to
> >>>> do the initial setup.  After install, it can then be called on demand
> >> in
> >>>> order.
> >>>>
> >>>> At this point, we should be all set, with everything running and
> >>> updatable.
> >>>>
> >>>> Justin
> >>>>
> >>>
> >>>
> >>
>
>

Re: [DISCUSS] Moving GeoIP management away from MySQL

Reply via email to