Hi Arun This looks really good and fills some obvious gaps in the security landscape.
Happy to contribute anyway you want. All the best!!! Bosco On 10/20/15, 8:02 AM, "Alex Karasulu" <akaras...@gmail.com on behalf of akaras...@apache.org> wrote: >Hi Arun, > >Eagle sounds very promising. I just had a discussion with someone about >this exact need. I do however agree with Greg on the name. As far as I can >see, besides the name, your weakest point is the all eBay employed team. >It's not a blocker and can be fixed during incubation. Good luck to you. > >Alex > > >On Tue, Oct 20, 2015 at 5:51 PM, Manoharan, Arun <armanoha...@ebay.com> >wrote: > >> Hi Greg, >> >> Thank you for reviewing the proposal. >> >> Originally we thought Eagle might be trademarked by someone already but I >> went thru eBay legal team to get the clearance for the name to be used. We >> will look into it again to see if there will be potential problems. >> >> Thanks, >> Arun >> >> On 10/20/15, 1:52 AM, "Greg Stein" <gst...@gmail.com> wrote: >> >> >Hey there, Arun! ... I have no commentary on the proposal itself, as it >> >looks like a great proposal. I would suggest being a bit wary of the name, >> >as "Eagle" is a *very* popular PCB design program. >> > >> >On Mon, Oct 19, 2015 at 10:33 AM, Manoharan, Arun <armanoha...@ebay.com> >> >wrote: >> > >> >> Hello Everyone, >> >> >> >> My name is Arun Manoharan. Currently a product manager in the Analytics >> >> platform team at eBay Inc. >> >> >> >> I would like to start a discussion on Eagle and its joining the ASF as >> >>an >> >> incubation project. >> >> >> >> Eagle is a Monitoring solution for Hadoop to instantly identify access >> >>to >> >> sensitive data, recognize attacks, malicious activities and take >> >>actions in >> >> real time. Eagle supports a wide variety of policies on HDFS data and >> >>Hive. >> >> Eagle also provides machine learning models for detecting anomalous user >> >> behavior in Hadoop. >> >> >> >> The proposal is available on the wiki here: >> >> https://wiki.apache.org/incubator/EagleProposal >> >> >> >> The text of the proposal is also available at the end of this email. >> >> >> >> Thanks for your time and help. >> >> >> >> Thanks, >> >> Arun >> >> >> >> <COPY of the proposal in text format> >> >> >> >> Eagle >> >> >> >> Abstract >> >> Eagle is an Open Source Monitoring solution for Hadoop to instantly >> >> identify access to sensitive data, recognize attacks, malicious >> >>activities >> >> in hadoop and take actions. >> >> >> >> Proposal >> >> Eagle audits access to HDFS files, Hive and HBase tables in real time, >> >> enforces policies defined on sensitive data access and alerts or blocks >> >> user¹s access to that sensitive data in real time. Eagle also creates >> >>user >> >> profiles based on the typical access behaviour for HDFS and Hive and >> >>sends >> >> alerts when anomalous behaviour is detected. Eagle can also import >> >> sensitive data information classified by external classification >> >>engines to >> >> help define its policies. >> >> >> >> Overview of Eagle >> >> Eagle has 3 main parts. >> >> 1.Data collection and storage - Eagle collects data from various hadoop >> >> logs in real time using Kafka/Yarn API and uses HDFS and HBase for >> >>storage. >> >> 2.Data processing and policy engine - Eagle allows users to create >> >> policies based on various metadata properties on HDFS, Hive and HBase >> >>data. >> >> 3.Eagle services - Eagle services include policy manager, query service >> >> and the visualization component. Eagle provides intuitive user >> >>interface to >> >> administer Eagle and an alert dashboard to respond to real time alerts. >> >> >> >> Data Collection and Storage: >> >> Eagle provides programming API for extending Eagle to integrate any data >> >> source into Eagle policy evaluation framework. For example, Eagle hdfs >> >> audit monitoring collects data from Kafka which is populated from >> >>namenode >> >> log4j appender or from logstash agent. Eagle hive monitoring collects >> >>hive >> >> query logs from running job through YARN API, which is designed to be >> >> scalable and fault-tolerant. Eagle uses HBase as storage for storing >> >> metadata and metrics data, and also supports relational database through >> >> configuration change. >> >> >> >> Data Processing and Policy Engine: >> >> Processing Engine: Eagle provides stream processing API which is an >> >> abstraction of Apache Storm. It can also be extended to other streaming >> >> engines. This abstraction allows developers to assemble data >> >> transformation, filtering, external data join etc. without physically >> >>bound >> >> to a specific streaming platform. Eagle streaming API allows developers >> >>to >> >> easily integrate business logic with Eagle policy engine and internally >> >> Eagle framework compiles business logic execution DAG into program >> >> primitives of underlying stream infrastructure e.g. Apache Storm. For >> >> example, Eagle HDFS monitoring transforms audit log from Namenode to >> >>object >> >> and joins sensitivity metadata, security zone metadata which are >> >>generated >> >> from external programs or configured by user. Eagle hive monitoring >> >>filters >> >> running jobs to get hive query string and parses query string into >> >>object >> >> and then joins sensitivity metadata. >> >> Alerting Framework: Eagle Alert Framework includes stream metadata API, >> >> scalable policy engine framework, extensible policy engine framework. >> >> Stream metadata API allows developers to declare event schema including >> >> what attributes constitute an event, what is the type for each >> >>attribute, >> >> and how to dynamically resolve attribute value in runtime when user >> >> configures policy. Scalable policy engine framework allows policies to >> >>be >> >> executed on different physical nodes in parallel. It is also used to >> >>define >> >> your own policy partitioner class. Policy engine framework together with >> >> streaming partitioning capability provided by all streaming platforms >> >>will >> >> make sure policies and events can be evaluated in a fully distributed >> >>way. >> >> Extensible policy engine framework allows developer to plugin a new >> >>policy >> >> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy >> >> engine which Eagle supports as first-class citizen. >> >> Machine Learning module: Eagle provides capabilities to define user >> >> activity patterns or user profiles for Hadoop users based on the user >> >> behaviour in the platform. These user profiles are modeled using Machine >> >> Learning algorithms and used for detection of anomalous users >> >>activities. >> >> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms >> >>for >> >> generating user profile models. The model reads data from HDFS audit >> >>logs, >> >> preprocesses and aggregates data, and generates models using Spark >> >> programming APIs. Once models are generated, Eagle uses stream >> >>processing >> >> engine for near real-time anomaly detection to determine if any user¹s >> >> activities are suspicious or not. >> >> >> >> Eagle Services: >> >> Query Service: Eagle provides SQL-like service API to support >> >> comprehensive computation for huge set of data on the fly, for e.g. >> >> comprehensive filtering, aggregation, histogram, sorting, top, >> >>arithmetical >> >> expression, pagination etc. HBase is the data storage which Eagle >> >>supports >> >> as first-class citizen, relational database is supported as well. For >> >>HBase >> >> storage, Eagle query framework compiles user provided SQL-like query >> >>into >> >> HBase native filter objects and execute it through HBase coprocessor on >> >>the >> >> fly. >> >> Policy Manager: Eagle policy manager provides UI and Restful API for >> >>user >> >> to define policy with just a few clicks. It includes site management UI, >> >> policy editor, sensitivity metadata import, HDFS or Hive sensitive >> >>resource >> >> browsing, alert dashboards etc. >> >> Background >> >> Data is one of the most important assets for today¹s businesses, which >> >> makes data security one of the top priorities of today¹s enterprises. >> >> Hadoop is widely used across different verticals as a big data >> >>repository >> >> to store this data in most modern enterprises. >> >> At eBay we use hadoop platform extensively for our data processing >> >>needs. >> >> Our data in Hadoop is becoming bigger and bigger as our user base is >> >>seeing >> >> an exponential growth. Today there are variety of data sets available in >> >> Hadoop cluster for our users to consume. eBay has around 120 PB of data >> >> stored in HDFS across 6 different clusters and around 1800+ active >> >>hadoop >> >> users consuming data thru Hive, HBase and mapreduce jobs everyday to >> >>build >> >> applications using this data. With this astronomical growth of data >> >>there >> >> are also challenges in securing sensitive data and monitoring the >> >>access to >> >> this sensitive data. Today in large organizations HDFS is the defacto >> >> standard for storing big data. Data sets which includes and not limited >> >>to >> >> consumer sentiment, social media data, customer segmentation, web >> >>clicks, >> >> sensor data, geo-location and transaction data get stored in Hadoop for >> >>day >> >> to day business needs. >> >> We at eBay want to make sure the sensitive data and data platforms are >> >> completely protected from security breaches. So we partnered very >> >>closely >> >> with our Information Security team to understand the requirements for >> >>Eagle >> >> to monitor sensitive data access on hadoop: >> >> 1.Ability to identify and stop security threats in real time >> >> 2.Scale for big data (Support PB scale and Billions of events) >> >> 3.Ability to create data access policies >> >> 4.Support multiple data sources like HDFS, HBase, Hive >> >> 5.Visualize alerts in real time >> >> 6.Ability to block malicious access in real time >> >> We did not find any data access monitoring solution that available today >> >> and can provide the features and functionality that we need to monitor >> >>the >> >> data access in the hadoop ecosystem at our scale. Hence with an >> >>excellent >> >> team of world class developers and several users, we have been able to >> >> bring Eagle into production as well as open source it. >> >> >> >> Rationale >> >> In today¹s world; data is an important asset for any company. Businesses >> >> are using data extensively to create amazing experiences for users. Data >> >> has to be protected and access to data should be secured from security >> >> breaches. Today Hadoop is not only used to store logs but also stores >> >> financial data, sensitive data sets, geographical data, user click >> >>stream >> >> data sets etc. which makes it more important to be protected from >> >>security >> >> breaches. To secure a data platform there are multiple things that need >> >>to >> >> happen. One is having a strong access control mechanism which today is >> >> provided by Apache Ranger and Apache Sentry. These tools provide the >> >> ability to provide fine grain access control mechanism to data sets on >> >> hadoop. But there is a big gap in terms of monitoring all the data >> >>access >> >> events and activities in order to securing the hadoop data platform. >> >> Together with strong access control, perimeter security and data access >> >> monitoring in place data in the hadoop clusters can be secured against >> >> breaches. We looked around and found following: >> >> Existing data activity monitoring products are designed for traditional >> >> databases and data warehouse. Existing monitoring platforms cannot scale >> >> out to support fast growing data and petabyte scale. Few products in the >> >> industry are still very early in terms of supporting HDFS, Hive, HBase >> >>data >> >> access monitoring. >> >> As mentioned in the background, the business requirement and urgency to >> >> secure the data from users with malicious intent drove eBay to invest in >> >> building a real time data access monitoring solution from scratch to >> >>offer >> >> real time alerts and remediation features for malicious data access. >> >> With the power of open source distributed systems like Hadoop, Kafka and >> >> much more we were able to develop a data activity monitoring system that >> >> can scale, identify and stop malicious access in real time. >> >> Eagle allows admins to create standard access policies and rules for >> >> monitoring HDFS, Hive and HBase data. Eagle also provides out of box >> >> machine learning models for modeling user profiles based on user access >> >> behaviour and use the model to alert on anomalies. >> >> >> >> Current Status >> >> >> >> Meritocracy >> >> Eagle has been deployed in production at eBay for monitoring billions of >> >> events per day from HDFS and Hive operations. From the start; the >> >>product >> >> has been built with focus on high scalability and application >> >>extensibility >> >> in mind and Eagle has demonstrated great performance in responding to >> >> suspicious events instantly and great flexibility in defining policy. >> >> >> >> Community >> >> Eagle seeks to develop the developer and user communities during >> >> incubation. >> >> >> >> Core Developers >> >> Eagle is currently being designed and developed by engineers from eBay >> >> Inc. Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, >> >> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of >> >> these core developers have deep expertise in developing monitoring >> >>products >> >> for the Hadoop ecosystem. >> >> >> >> Alignment >> >> The ASF is a natural host for Eagle given that it is already the home of >> >> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data >> >> projects. Eagle leverages lot of Apache open-source products. Eagle was >> >> designed to offer real time insights into sensitive data access by >> >>actively >> >> monitoring the data access on various data sets in hadoop and an >> >>extensible >> >> alerting framework with a powerful policy engine. Eagle compliments the >> >> existing Hadoop platform area by providing a comprehensive monitoring >> >>and >> >> alerting solution for detecting sensitive data access threats based on >> >> preset policies and machine learning models for user behaviour analysis. >> >> >> >> Known Risks >> >> >> >> Orphaned Products >> >> The core developers of Eagle team work full time on this project. There >> >>is >> >> no risk of Eagle getting orphaned since eBay is extensively using it in >> >> their production Hadoop clusters and have plans to go beyond hadoop. For >> >> example, currently there are 7 hadoop clusters and 2 of them are being >> >> monitored using Hadoop Eagle in production. We have plans to extend it >> >>to >> >> all hadoop clusters and eventually other data platforms. There are 10¹s >> >>of >> >> policies onboarded and actively monitored with plans to onboard more use >> >> case. We are very confident that every hadoop cluster in the world will >> >>be >> >> monitored using Eagle for securing the hadoop ecosystem by actively >> >> monitoring for data access on sensitive data. We plan to extend and >> >> diversify this community further through Apache. We presented Eagle at >> >>the >> >> hadoop summit in china and garnered interest from different companies >> >>who >> >> use hadoop extensively. >> >> >> >> Inexperience with Open Source >> >> The core developers are all active users and followers of open source. >> >> They are already committers and contributors to the Eagle Github >> >>project. >> >> All have been involved with the source code that has been released >> >>under an >> >> open source license, and several of them also have experience developing >> >> code in an open source environment. Though the core set of Developers do >> >> not have Apache Open Source experience, there are plans to onboard >> >> individuals with Apache open source experience on to the project. Apache >> >> Kylin PMC members are also in the same ebay organization. We work very >> >> closely with Apache Ranger committers and are looking forward to find >> >> meaningful integrations to improve the security of hadoop platform. >> >> >> >> Homogenous Developers >> >> The core developers are from eBay. Today the problem of monitoring data >> >> activities to find and stop threats is a universal problem faced by all >> >>the >> >> businesses. Apache Incubation process encourages an open and diverse >> >> meritocratic community. Eagle intends to make every possible effort to >> >> build a diverse, vibrant and involved community and has already received >> >> substantial interest from various organizations. >> >> >> >> Reliance on Salaried Developers >> >> eBay invested in Eagle as the monitoring solution for Hadoop clusters >> >>and >> >> some of its key engineers are working full time on the project. In >> >> addition, since there is a growing need for securing sensitive data >> >>access >> >> we need a data activity monitoring solution for Hadoop, we look forward >> >>to >> >> other Apache developers and researchers to contribute to the project. >> >> Additional contributors, including Apache committers have plans to join >> >> this effort shortly. Also key to addressing the risk associated with >> >> relying on Salaried developers from a single entity is to increase the >> >> diversity of the contributors and actively lobby for Domain experts in >> >>the >> >> security space to contribute. Eagle intends to do this. >> >> >> >> Relationships with Other Apache Products >> >> Eagle has a strong relationship and dependency with Apache Hadoop, >> >>HBase, >> >> Spark, Kafka and Storm. Being part of Apache¹s Incubation community, >> >>could >> >> help with a closer collaboration among these projects and as well as >> >> others. An Excessive Fascination with the Apache Brand Eagle is >> >>proposing >> >> to enter incubation at Apache in order to help efforts to diversify the >> >> committer-base, not so much to capitalize on the Apache brand. The Eagle >> >> project is in production use already inside eBay, but is not expected >> >>to be >> >> an eBay product for external customers. As such, the Eagle project is >> >>not >> >> seeking to use the Apache brand as a marketing tool. >> >> >> >> Documentation >> >> Information about Eagle can be found at https://github.com/eBay/Eagle. >> >> The following link provide more information about Eagle >> >>http://goeagle.io. >> >> >> >> Initial Source >> >> Eagle has been under development since 2014 by a team of engineers at >> >>eBay >> >> Inc. It is currently hosted on Github.com under an Apache license 2.0 at >> >> https://github.com/eBay/Eagle. Once in incubation we will be moving the >> >> code base to apache git library. >> >> >> >> External Dependencies >> >> Eagle has the following external dependencies. >> >> Basic >> >> €JDK 1.7+ >> >> €Scala 2.10.4 >> >> €Apache Maven >> >> €JUnit >> >> €Log4j >> >> €Slf4j >> >> €Apache Commons >> >> €Apache Commons Math3 >> >> €Jackson >> >> €Siddhi CEP engine >> >> >> >> Hadoop >> >> €Apache Hadoop >> >> €Apache HBase >> >> €Apache Hive >> >> €Apache Zookeeper >> >> €Apache Curator >> >> >> >> Apache Spark >> >> €Spark Core Library >> >> >> >> REST Service >> >> €Jersey >> >> >> >> Query >> >> €Antlr >> >> >> >> Stream processing >> >> €Apache Storm >> >> €Apache Kafka >> >> >> >> Web >> >> €AngularJS >> >> €jQuery >> >> €Bootstrap V3 >> >> €Moment JS >> >> €Admin LTE >> >> €html5shiv >> >> €respond >> >> €Fastclick >> >> €Date Range Picker >> >> €Flot JS >> >> >> >> Cryptography >> >> Eagle will eventually support encryption on the wire. This is not one of >> >> the initial goals, and we do not expect Eagle to be a controlled export >> >> item due to the use of encryption. Eagle supports but does not require >> >>the >> >> Kerberos authentication mechanism to access secured Hadoop services. >> >> >> >> Required Resources >> >> >> >> Mailing List >> >> €eagle-private for private PMC discussions >> >> €eagle-dev for developers >> >> €eagle-commits for all commits >> >> €eagle-users for all eagle users >> >> >> >> Subversion Directory >> >> €Git is the preferred source control system. >> >> >> >> Issue Tracking >> >> €JIRA Eagle (Eagle) >> >> >> >> Other Resources >> >> The existing code already has unit tests so we will make use of existing >> >> Apache continuous testing infrastructure. The resulting load should not >> >>be >> >> very large. >> >> >> >> Initial Committers >> >> €Seshu Adunuthula <sadunuthula at ebay dot com> >> >> €Arun Manoharan <armanoharan at ebay dot com> >> >> €Edward Zhang <yonzhang at ebay dot com> >> >> €Hao Chen <hchen9 at ebay dot com> >> >> €Chaitali Gupta <cgupta at ebay dot com> >> >> €Libin Sun <libsun at ebay dot com> >> >> €Jilin Jiang <jiljiang at ebay dot com> >> >> €Qingwen Zhao <qingwzhao at ebay dot com> >> >> €Hemanth Dendukuri <hdendukuri at ebay dot com> >> >> €Senthil Kumar <senthilkumar at ebay dot com> >> >> €Tan Chen <tanchen at ebay dot com> >> >> >> >> Affiliations >> >> The initial committers are employees of eBay Inc. >> >> >> >> Sponsors >> >> >> >> Champion >> >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> >> >> >> Nominated Mentors >> >> €Owen O¹Malley < omalley at apache dot org > - Apache IPMC member, >> >> Hortonworks >> >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> >> €Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, >> >> Hortonworks >> >> >> >> Sponsoring Entity >> >> We are requesting the Incubator to sponsor this project. >> >> >> >> >> >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >> > > >-- >Best Regards, >-- Alex --------------------------------------------------------------------- To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org For additional commands, e-mail: general-h...@incubator.apache.org