Re: Long-term storage for enriched data

zeo...@gmail.com Fri, 06 Jan 2017 14:22:33 -0800

I think we can use the ES templates to start off, as Avro (HDFS) and ES
should be in sync.  There already is some normalization in place for field
names (ip_src_addr, etc.) so that enrichment can work across any log type.
This is currently handled in the parsers.


I think what you're talking about here is simply expanding that to handle
more fields, right?  I'm all for that - the more normalized the data is the
more fun I can have with it :)

Jon

On Fri, Jan 6, 2017, 4:26 PM Kyle Richardson <kylerichards...@gmail.com>
wrote:

> You're right. I don't think it needs to account for everything. As I
> understand it, one of the big selling features for Avro is the schema
> evolution.
>
> Do you think we can take the ES templates as a starting point for
> developing an Avro schema? I do still think we need some type of
> normalization across the sensors for fields like URL, user name, and
> disposition. This wouldn't be specific to Avro but would allow us to better
> search across multiple sensor types in the UI too. Say, for example, if I
> have two different proxy solutions.
>
> -Kyle
>
> On Fri, Jan 6, 2017 at 2:28 PM, zeo...@gmail.com <zeo...@gmail.com> wrote:
>
> > Does it really need to account for all enrichments off the bat?  I'm not
> > familiar with these options in practice but my research led me to believe
> > that adding fields to the Avro schema is not a huge issue, changing or
> > removing them is the true problem.  I have no proof to substantiate my
> > claim however, just that I heard that question get asked, and I read
> > responses from people familiar with Avro reply uniformly in that way.
> >
> > My thoughts, based off of my assumption, is that we simply need to handle
> > out of the box enrichments and document a required schema change in our
> > guides to creating custom enrichments.
> >
> > In ES we are currently doing one template per sensor which gives us that
> > overlapping field name (per sensor) flexibility.
> >
> > Jon
> >
> > On Fri, Jan 6, 2017, 12:33 PM Kyle Richardson <kylerichards...@gmail.com
> >
> > wrote:
> >
> > > Thanks, Jon. Really interesting talk.
> > >
> > > For the GitHub data set discussed (which probably most closely mimics
> > > Metron data due to number of fields and overall diversity), Avro with
> > > Snappy compression seemed like the best balance of storage size and
> > > retrieval time. I did find it interesting that he said Parquet was
> > > originally developed for log data sets but didn't perform as well on
> the
> > > GitHub data.
> > >
> > > I think our challenge is going to be on the schema. Would we create a
> > > schema per sensor type and try to account for all of the possible
> > > enrichments? Problem there is that similar data may not be mapped to
> the
> > > same field names across sensors. We may need to think about expanding
> our
> > > base JSON schema beyond these 7 fields (
> > > https://cwiki.apache.org/confluence/display/METRON/Metron+JSON+Object)
> > to
> > > account for normalizing things like URL, user name, and disposition
> (e.g.
> > > whether an action was allowed or denied).
> > >
> > > Thoughts?
> > >
> > > -Kyle
> > >
> > > On Tue, Jan 3, 2017 at 11:30 AM, zeo...@gmail.com <zeo...@gmail.com>
> > > wrote:
> > >
> > > > For those interested, I ended up finding a recording of the talk
> itself
> > > > when doing some Avro research - https://www.youtube.com/watch?
> > > > v=tB28rPTvRiI
> > > >
> > > > Jon
> > > >
> > > > On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <ma...@apache.org> wrote:
> > > >
> > > > > I’m not an expert on these things, but my understanding is that
> Avro
> > > and
> > > > > ORC serve many of the same needs.  The biggest difference is that
> ORC
> > > is
> > > > > columnar, and Avro isn’t.  Avro, ORC, and Parquet were compared in
> > > detail
> > > > > at last year’s Hadoop Summit; the slideshare prezo is here:
> > > > > http://www.slideshare.net/HadoopSummit/file-format-
> > > > benchmark-avro-json-orc-parquet
> > > > >
> > > > > It’s conclusion: “For complex tables with common strings, Avro with
> > > > Snappy
> > > > > is a good fit.  For other tables [or when applications “just need a
> > few
> > > > > columns” of the tables], ORC with Zlib is a good fit.”  (The
> addition
> > > in
> > > > > square brackets incorporates a quote from another part of the
> prezo.)
> > > > But
> > > > > do look at the prezo please, it gives detailed benchmarks showing
> > when
> > > > each
> > > > > one is better.
> > > > >
> > > > > --Matt
> > > > >
> > > > > On 1/1/17, 5:18 AM, "zeo...@gmail.com" <zeo...@gmail.com> wrote:
> > > > >
> > > > >     I don't recall a conversation on that product specifically, but
> > > I've
> > > > >     definitely brought up the need to search HDFS from time to
> time.
> > > > > Things
> > > > >     like Spark SQL, Hive, Oozie have been discussed, but Avro is
> new
> > to
> > > > me
> > > > > I'll
> > > > >     have to look into it.  Are you able to summarize it's benefits?
> > > > >
> > > > >     Jon
> > > > >
> > > > >     On Wed, Dec 28, 2016, 14:45 Kyle Richardson <
> > > > kylerichards...@gmail.com
> > > > > >
> > > > >     wrote:
> > > > >
> > > > >     > This thread got me thinking... there are likely a fair number
> > of
> > > > use
> > > > > cases
> > > > >     > for searching and analyzing the output stored in HDFS. Dima's
> > use
> > > > > case is
> > > > >     > certainly one. Has there been any discussion on the use of
> Avro
> > > to
> > > > > store
> > > > >     > the output in HDFS? This would likely require an expansion of
> > the
> > > > > current
> > > > >     > json schema.
> > > > >     >
> > > > >     > -Kyle
> > > > >     >
> > > > >     > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella <
> > > ceste...@gmail.com>
> > > > > wrote:
> > > > >     >
> > > > >     > > Oozie (or something like it) would appear to me to be the
> > > correct
> > > > > tool
> > > > >     > > here.  You are likely moving files around and pinning up
> hive
> > > > > tables:
> > > > >     > >
> > > > >     > >    - Moving the data written in HDFS from
> > > > > /apps/metron/enrichment/${
> > > > >     > > sensor}
> > > > >     > >    to another directory in HDFS
> > > > >     > >    - Running a job in Hive or pig or spark to take the JSON
> > > > blobs,
> > > > > map
> > > > >     > them
> > > > >     > >    to rows and pin it up as an ORC table for downstream
> > > analytics
> > > > >     > >
> > > > >     > > NiFi is mostly about getting data in the cluster, not
> really
> > > for
> > > > >     > scheduling
> > > > >     > > large-scale batch ETL, I think.
> > > > >     > >
> > > > >     > > Casey
> > > > >     > >
> > > > >     > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov <
> > > > > dima.koval...@sstech.us>
> > > > >     > > wrote:
> > > > >     > >
> > > > >     > > > Thank you for reply Carolyn,
> > > > >     > > >
> > > > >     > > > Currently for the test purposes we enrich flow with Geo
> and
> > > > > ThreatIntel
> > > > >     > > > malware IP, but plan to expand this further.
> > > > >     > > >
> > > > >     > > > Our dev team is working on Oozie job to process this. So
> > > > > meanwhile I
> > > > >     > > > wonder if I could use NiFi for this purpose (because we
> > > already
> > > > > using
> > > > >     > it
> > > > >     > > > for data ingest and stream).
> > > > >     > > >
> > > > >     > > > Could you elaborate why it may be overkill? The idea is
> to
> > > have
> > > > >     > > > everything in one place instead of hacking into Metron
> > > > libraries
> > > > > and
> > > > >     > > code.
> > > > >     > > >
> > > > >     > > > - Dima
> > > > >     > > >
> > > > >     > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote:
> > > > >     > > > > Hi Dima -
> > > > >     > > > >
> > > > >     > > > > What type of analytics are you looking to do?  Is the
> > > > > normalized
> > > > >     > format
> > > > >     > > > not working?  You could use an oozie or spark job to
> create
> > > > > derivative
> > > > >     > > > tables.
> > > > >     > > > >
> > > > >     > > > > Nifi may be overkill for breaking up the kafka stream.
> > > Spark
> > > > >     > streaming
> > > > >     > > > may be easier.
> > > > >     > > > >
> > > > >     > > > > Thanks
> > > > >     > > > > Carolyn
> > > > >     > > > >
> > > > >     > > > >
> > > > >     > > > >
> > > > >     > > > > Sent from my Verizon, Samsung Galaxy smartphone
> > > > >     > > > >
> > > > >     > > > >
> > > > >     > > > > -------- Original message --------
> > > > >     > > > > From: Dima Kovalyov <dima.koval...@sstech.us>
> > > > >     > > > > Date: 12/21/16 6:28 PM (GMT-05:00)
> > > > >     > > > > To: dev@metron.incubator.apache.org
> > > > >     > > > > Subject: Long-term storage for enriched data
> > > > >     > > > >
> > > > >     > > > > Hello,
> > > > >     > > > >
> > > > >     > > > > Currently we are researching fast and resources
> efficient
> > > way
> > > > > to save
> > > > >     > > > > enriched data in Hive for further Analytics.
> > > > >     > > > >
> > > > >     > > > > There are two scenarios that we consider:
> > > > >     > > > > a) Use Ozzie Java job that uses Metron enrichment
> classes
> > > to
> > > > >     > "manually"
> > > > >     > > > > enrich each line of the source data that is picked up
> > from
> > > > the
> > > > > source
> > > > >     > > > > dir (the one that we have developed already and using).
> > > That
> > > > is
> > > > >     > > > > something that we developed on our own. Downside:
> custom
> > > code
> > > > > that
> > > > >     > > built
> > > > >     > > > > on top of Metron source code.
> > > > >     > > > >
> > > > >     > > > > b) Use NiFi to listen for indexing Kafka topic -> split
> > > > stream
> > > > > by
> > > > >     > > source
> > > > >     > > > > type -> Put every source type in corresponding Hive
> > table.
> > > > >     > > > >
> > > > >     > > > > I wonder, if someone was going any of this direction
> and
> > if
> > > > > there are
> > > > >     > > > > best practices for this? Please advise.
> > > > >     > > > > Thank you.
> > > > >     > > > >
> > > > >     > > > > - Dima
> > > > >     > > > >
> > > > >     > > > >
> > > > >     > > >
> > > > >     > > >
> > > > >     > >
> > > > >     >
> > > > >     --
> > > > >
> > > > >     Jon
> > > > >
> > > > >     Sent from my mobile device
> > > > >
> > > > >
> > > > >
> > > > >
> > > > > --
> > > >
> > > > Jon
> > > >
> > > > Sent from my mobile device
> > > >
> > >
> > --
> >
> > Jon
> >
> > Sent from my mobile device
> >
>
-- 

Jon

Sent from my mobile device

Re: Long-term storage for enriched data

Reply via email to