I think we can use the ES templates to start off, as Avro (HDFS) and ES should be in sync. There already is some normalization in place for field names (ip_src_addr, etc.) so that enrichment can work across any log type. This is currently handled in the parsers.
I think what you're talking about here is simply expanding that to handle more fields, right? I'm all for that - the more normalized the data is the more fun I can have with it :) Jon On Fri, Jan 6, 2017, 4:26 PM Kyle Richardson <kylerichards...@gmail.com> wrote: > You're right. I don't think it needs to account for everything. As I > understand it, one of the big selling features for Avro is the schema > evolution. > > Do you think we can take the ES templates as a starting point for > developing an Avro schema? I do still think we need some type of > normalization across the sensors for fields like URL, user name, and > disposition. This wouldn't be specific to Avro but would allow us to better > search across multiple sensor types in the UI too. Say, for example, if I > have two different proxy solutions. > > -Kyle > > On Fri, Jan 6, 2017 at 2:28 PM, zeo...@gmail.com <zeo...@gmail.com> wrote: > > > Does it really need to account for all enrichments off the bat? I'm not > > familiar with these options in practice but my research led me to believe > > that adding fields to the Avro schema is not a huge issue, changing or > > removing them is the true problem. I have no proof to substantiate my > > claim however, just that I heard that question get asked, and I read > > responses from people familiar with Avro reply uniformly in that way. > > > > My thoughts, based off of my assumption, is that we simply need to handle > > out of the box enrichments and document a required schema change in our > > guides to creating custom enrichments. > > > > In ES we are currently doing one template per sensor which gives us that > > overlapping field name (per sensor) flexibility. > > > > Jon > > > > On Fri, Jan 6, 2017, 12:33 PM Kyle Richardson <kylerichards...@gmail.com > > > > wrote: > > > > > Thanks, Jon. Really interesting talk. > > > > > > For the GitHub data set discussed (which probably most closely mimics > > > Metron data due to number of fields and overall diversity), Avro with > > > Snappy compression seemed like the best balance of storage size and > > > retrieval time. I did find it interesting that he said Parquet was > > > originally developed for log data sets but didn't perform as well on > the > > > GitHub data. > > > > > > I think our challenge is going to be on the schema. Would we create a > > > schema per sensor type and try to account for all of the possible > > > enrichments? Problem there is that similar data may not be mapped to > the > > > same field names across sensors. We may need to think about expanding > our > > > base JSON schema beyond these 7 fields ( > > > https://cwiki.apache.org/confluence/display/METRON/Metron+JSON+Object) > > to > > > account for normalizing things like URL, user name, and disposition > (e.g. > > > whether an action was allowed or denied). > > > > > > Thoughts? > > > > > > -Kyle > > > > > > On Tue, Jan 3, 2017 at 11:30 AM, zeo...@gmail.com <zeo...@gmail.com> > > > wrote: > > > > > > > For those interested, I ended up finding a recording of the talk > itself > > > > when doing some Avro research - https://www.youtube.com/watch? > > > > v=tB28rPTvRiI > > > > > > > > Jon > > > > > > > > On Sun, Jan 1, 2017 at 8:41 PM Matt Foley <ma...@apache.org> wrote: > > > > > > > > > I’m not an expert on these things, but my understanding is that > Avro > > > and > > > > > ORC serve many of the same needs. The biggest difference is that > ORC > > > is > > > > > columnar, and Avro isn’t. Avro, ORC, and Parquet were compared in > > > detail > > > > > at last year’s Hadoop Summit; the slideshare prezo is here: > > > > > http://www.slideshare.net/HadoopSummit/file-format- > > > > benchmark-avro-json-orc-parquet > > > > > > > > > > It’s conclusion: “For complex tables with common strings, Avro with > > > > Snappy > > > > > is a good fit. For other tables [or when applications “just need a > > few > > > > > columns” of the tables], ORC with Zlib is a good fit.” (The > addition > > > in > > > > > square brackets incorporates a quote from another part of the > prezo.) > > > > But > > > > > do look at the prezo please, it gives detailed benchmarks showing > > when > > > > each > > > > > one is better. > > > > > > > > > > --Matt > > > > > > > > > > On 1/1/17, 5:18 AM, "zeo...@gmail.com" <zeo...@gmail.com> wrote: > > > > > > > > > > I don't recall a conversation on that product specifically, but > > > I've > > > > > definitely brought up the need to search HDFS from time to > time. > > > > > Things > > > > > like Spark SQL, Hive, Oozie have been discussed, but Avro is > new > > to > > > > me > > > > > I'll > > > > > have to look into it. Are you able to summarize it's benefits? > > > > > > > > > > Jon > > > > > > > > > > On Wed, Dec 28, 2016, 14:45 Kyle Richardson < > > > > kylerichards...@gmail.com > > > > > > > > > > > wrote: > > > > > > > > > > > This thread got me thinking... there are likely a fair number > > of > > > > use > > > > > cases > > > > > > for searching and analyzing the output stored in HDFS. Dima's > > use > > > > > case is > > > > > > certainly one. Has there been any discussion on the use of > Avro > > > to > > > > > store > > > > > > the output in HDFS? This would likely require an expansion of > > the > > > > > current > > > > > > json schema. > > > > > > > > > > > > -Kyle > > > > > > > > > > > > On Thu, Dec 22, 2016 at 5:53 PM, Casey Stella < > > > ceste...@gmail.com> > > > > > wrote: > > > > > > > > > > > > > Oozie (or something like it) would appear to me to be the > > > correct > > > > > tool > > > > > > > here. You are likely moving files around and pinning up > hive > > > > > tables: > > > > > > > > > > > > > > - Moving the data written in HDFS from > > > > > /apps/metron/enrichment/${ > > > > > > > sensor} > > > > > > > to another directory in HDFS > > > > > > > - Running a job in Hive or pig or spark to take the JSON > > > > blobs, > > > > > map > > > > > > them > > > > > > > to rows and pin it up as an ORC table for downstream > > > analytics > > > > > > > > > > > > > > NiFi is mostly about getting data in the cluster, not > really > > > for > > > > > > scheduling > > > > > > > large-scale batch ETL, I think. > > > > > > > > > > > > > > Casey > > > > > > > > > > > > > > On Thu, Dec 22, 2016 at 5:18 PM, Dima Kovalyov < > > > > > dima.koval...@sstech.us> > > > > > > > wrote: > > > > > > > > > > > > > > > Thank you for reply Carolyn, > > > > > > > > > > > > > > > > Currently for the test purposes we enrich flow with Geo > and > > > > > ThreatIntel > > > > > > > > malware IP, but plan to expand this further. > > > > > > > > > > > > > > > > Our dev team is working on Oozie job to process this. So > > > > > meanwhile I > > > > > > > > wonder if I could use NiFi for this purpose (because we > > > already > > > > > using > > > > > > it > > > > > > > > for data ingest and stream). > > > > > > > > > > > > > > > > Could you elaborate why it may be overkill? The idea is > to > > > have > > > > > > > > everything in one place instead of hacking into Metron > > > > libraries > > > > > and > > > > > > > code. > > > > > > > > > > > > > > > > - Dima > > > > > > > > > > > > > > > > On 12/22/2016 02:26 AM, Carolyn Duby wrote: > > > > > > > > > Hi Dima - > > > > > > > > > > > > > > > > > > What type of analytics are you looking to do? Is the > > > > > normalized > > > > > > format > > > > > > > > not working? You could use an oozie or spark job to > create > > > > > derivative > > > > > > > > tables. > > > > > > > > > > > > > > > > > > Nifi may be overkill for breaking up the kafka stream. > > > Spark > > > > > > streaming > > > > > > > > may be easier. > > > > > > > > > > > > > > > > > > Thanks > > > > > > > > > Carolyn > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Sent from my Verizon, Samsung Galaxy smartphone > > > > > > > > > > > > > > > > > > > > > > > > > > > -------- Original message -------- > > > > > > > > > From: Dima Kovalyov <dima.koval...@sstech.us> > > > > > > > > > Date: 12/21/16 6:28 PM (GMT-05:00) > > > > > > > > > To: dev@metron.incubator.apache.org > > > > > > > > > Subject: Long-term storage for enriched data > > > > > > > > > > > > > > > > > > Hello, > > > > > > > > > > > > > > > > > > Currently we are researching fast and resources > efficient > > > way > > > > > to save > > > > > > > > > enriched data in Hive for further Analytics. > > > > > > > > > > > > > > > > > > There are two scenarios that we consider: > > > > > > > > > a) Use Ozzie Java job that uses Metron enrichment > classes > > > to > > > > > > "manually" > > > > > > > > > enrich each line of the source data that is picked up > > from > > > > the > > > > > source > > > > > > > > > dir (the one that we have developed already and using). > > > That > > > > is > > > > > > > > > something that we developed on our own. Downside: > custom > > > code > > > > > that > > > > > > > built > > > > > > > > > on top of Metron source code. > > > > > > > > > > > > > > > > > > b) Use NiFi to listen for indexing Kafka topic -> split > > > > stream > > > > > by > > > > > > > source > > > > > > > > > type -> Put every source type in corresponding Hive > > table. > > > > > > > > > > > > > > > > > > I wonder, if someone was going any of this direction > and > > if > > > > > there are > > > > > > > > > best practices for this? Please advise. > > > > > > > > > Thank you. > > > > > > > > > > > > > > > > > > - Dima > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > Jon > > > > > > > > > > Sent from my mobile device > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Jon > > > > > > > > Sent from my mobile device > > > > > > > > > -- > > > > Jon > > > > Sent from my mobile device > > > -- Jon Sent from my mobile device