+1 (non-binding) "Manoharan, Arun" <armanoha...@ebay.com>编写:
>Hello Everyone, > >Thanks for all the feedback on the Eagle Proposal. > >I would like to call for a [VOTE] on Eagle joining the ASF as an incubation >project. > >The vote is open for 72 hours: > >[ ] +1 accept Eagle in the Incubator >[ ] ±0 >[ ] -1 (please give reason) > >Eagle is a Monitoring solution for Hadoop to instantly identify access to >sensitive data, recognize attacks, malicious activities and take actions in >real time. Eagle supports a wide variety of policies on HDFS data and Hive. >Eagle also provides machine learning models for detecting anomalous user >behavior in Hadoop. > >The proposal is available on the wiki here: >https://wiki.apache.org/incubator/EagleProposal > >The text of the proposal is also available at the end of this email. > >Thanks for your time and help. > >Thanks, >Arun > ><COPY of the proposal in text format> > >Eagle > >Abstract >Eagle is an Open Source Monitoring solution for Hadoop to instantly identify >access to sensitive data, recognize attacks, malicious activities in hadoop >and take actions. > >Proposal >Eagle audits access to HDFS files, Hive and HBase tables in real time, >enforces policies defined on sensitive data access and alerts or blocks user’s >access to that sensitive data in real time. Eagle also creates user profiles >based on the typical access behaviour for HDFS and Hive and sends alerts when >anomalous behaviour is detected. Eagle can also import sensitive data >information classified by external classification engines to help define its >policies. > >Overview of Eagle >Eagle has 3 main parts. >1.Data collection and storage - Eagle collects data from various hadoop logs >in real time using Kafka/Yarn API and uses HDFS and HBase for storage. >2.Data processing and policy engine - Eagle allows users to create policies >based on various metadata properties on HDFS, Hive and HBase data. >3.Eagle services - Eagle services include policy manager, query service and >the visualization component. Eagle provides intuitive user interface to >administer Eagle and an alert dashboard to respond to real time alerts. > >Data Collection and Storage: >Eagle provides programming API for extending Eagle to integrate any data >source into Eagle policy evaluation framework. For example, Eagle hdfs audit >monitoring collects data from Kafka which is populated from namenode log4j >appender or from logstash agent. Eagle hive monitoring collects hive query >logs from running job through YARN API, which is designed to be scalable and >fault-tolerant. Eagle uses HBase as storage for storing metadata and metrics >data, and also supports relational database through configuration change. > >Data Processing and Policy Engine: >Processing Engine: Eagle provides stream processing API which is an >abstraction of Apache Storm. It can also be extended to other streaming >engines. This abstraction allows developers to assemble data transformation, >filtering, external data join etc. without physically bound to a specific >streaming platform. Eagle streaming API allows developers to easily integrate >business logic with Eagle policy engine and internally Eagle framework >compiles business logic execution DAG into program primitives of underlying >stream infrastructure e.g. Apache Storm. For example, Eagle HDFS monitoring >transforms audit log from Namenode to object and joins sensitivity metadata, >security zone metadata which are generated from external programs or >configured by user. Eagle hive monitoring filters running jobs to get hive >query string and parses query string into object and then joins sensitivity >metadata. >Alerting Framework: Eagle Alert Framework includes stream metadata API, >scalable policy engine framework, extensible policy engine framework. Stream >metadata API allows developers to declare event schema including what >attributes constitute an event, what is the type for each attribute, and how >to dynamically resolve attribute value in runtime when user configures policy. >Scalable policy engine framework allows policies to be executed on different >physical nodes in parallel. It is also used to define your own policy >partitioner class. Policy engine framework together with streaming >partitioning capability provided by all streaming platforms will make sure >policies and events can be evaluated in a fully distributed way. Extensible >policy engine framework allows developer to plugin a new policy engine with a >few lines of codes. WSO2 Siddhi CEP engine is the policy engine which Eagle >supports as first-class citizen. >Machine Learning module: Eagle provides capabilities to define user activity >patterns or user profiles for Hadoop users based on the user behaviour in the >platform. These user profiles are modeled using Machine Learning algorithms >and used for detection of anomalous users activities. Eagle uses Eigen Value >Decomposition, and Density Estimation algorithms for generating user profile >models. The model reads data from HDFS audit logs, preprocesses and aggregates >data, and generates models using Spark programming APIs. Once models are >generated, Eagle uses stream processing engine for near real-time anomaly >detection to determine if any user’s activities are suspicious or not. > >Eagle Services: >Query Service: Eagle provides SQL-like service API to support comprehensive >computation for huge set of data on the fly, for e.g. comprehensive filtering, >aggregation, histogram, sorting, top, arithmetical expression, pagination etc. >HBase is the data storage which Eagle supports as first-class citizen, >relational database is supported as well. For HBase storage, Eagle query >framework compiles user provided SQL-like query into HBase native filter >objects and execute it through HBase coprocessor on the fly. >Policy Manager: Eagle policy manager provides UI and Restful API for user to >define policy with just a few clicks. It includes site management UI, policy >editor, sensitivity metadata import, HDFS or Hive sensitive resource browsing, >alert dashboards etc. >Background >Data is one of the most important assets for today’s businesses, which makes >data security one of the top priorities of today’s enterprises. Hadoop is >widely used across different verticals as a big data repository to store this >data in most modern enterprises. >At eBay we use hadoop platform extensively for our data processing needs. Our >data in Hadoop is becoming bigger and bigger as our user base is seeing an >exponential growth. Today there are variety of data sets available in Hadoop >cluster for our users to consume. eBay has around 120 PB of data stored in >HDFS across 6 different clusters and around 1800+ active hadoop users >consuming data thru Hive, HBase and mapreduce jobs everyday to build >applications using this data. With this astronomical growth of data there are >also challenges in securing sensitive data and monitoring the access to this >sensitive data. Today in large organizations HDFS is the defacto standard for >storing big data. Data sets which includes and not limited to consumer >sentiment, social media data, customer segmentation, web clicks, sensor data, >geo-location and transaction data get stored in Hadoop for day to day business >needs. >We at eBay want to make sure the sensitive data and data platforms are >completely protected from security breaches. So we partnered very closely with >our Information Security team to understand the requirements for Eagle to >monitor sensitive data access on hadoop: >1.Ability to identify and stop security threats in real time >2.Scale for big data (Support PB scale and Billions of events) >3.Ability to create data access policies >4.Support multiple data sources like HDFS, HBase, Hive >5.Visualize alerts in real time >6.Ability to block malicious access in real time >We did not find any data access monitoring solution that available today and >can provide the features and functionality that we need to monitor the data >access in the hadoop ecosystem at our scale. Hence with an excellent team of >world class developers and several users, we have been able to bring Eagle >into production as well as open source it. > >Rationale >In today’s world; data is an important asset for any company. Businesses are >using data extensively to create amazing experiences for users. Data has to be >protected and access to data should be secured from security breaches. Today >Hadoop is not only used to store logs but also stores financial data, >sensitive data sets, geographical data, user click stream data sets etc. which >makes it more important to be protected from security breaches. To secure a >data platform there are multiple things that need to happen. One is having a >strong access control mechanism which today is provided by Apache Ranger and >Apache Sentry. These tools provide the ability to provide fine grain access >control mechanism to data sets on hadoop. But there is a big gap in terms of >monitoring all the data access events and activities in order to securing the >hadoop data platform. Together with strong access control, perimeter security >and data access monitoring in place data in the hadoop clusters can be secured >against breaches. We looked around and found following: >Existing data activity monitoring products are designed for traditional >databases and data warehouse. Existing monitoring platforms cannot scale out >to support fast growing data and petabyte scale. Few products in the industry >are still very early in terms of supporting HDFS, Hive, HBase data access >monitoring. >As mentioned in the background, the business requirement and urgency to secure >the data from users with malicious intent drove eBay to invest in building a >real time data access monitoring solution from scratch to offer real time >alerts and remediation features for malicious data access. >With the power of open source distributed systems like Hadoop, Kafka and much >more we were able to develop a data activity monitoring system that can scale, >identify and stop malicious access in real time. >Eagle allows admins to create standard access policies and rules for >monitoring HDFS, Hive and HBase data. Eagle also provides out of box machine >learning models for modeling user profiles based on user access behaviour and >use the model to alert on anomalies. > >Current Status > >Meritocracy >Eagle has been deployed in production at eBay for monitoring billions of >events per day from HDFS and Hive operations. From the start; the product has >been built with focus on high scalability and application extensibility in >mind and Eagle has demonstrated great performance in responding to suspicious >events instantly and great flexibility in defining policy. > >Community >Eagle seeks to develop the developer and user communities during incubation. > >Core Developers >Eagle is currently being designed and developed by engineers from eBay Inc. – >Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, Qingwen Zhao, >Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of these core developers >have deep expertise in developing monitoring products for the Hadoop ecosystem. > >Alignment >The ASF is a natural host for Eagle given that it is already the home of >Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data projects. >Eagle leverages lot of Apache open-source products. Eagle was designed to >offer real time insights into sensitive data access by actively monitoring the >data access on various data sets in hadoop and an extensible alerting >framework with a powerful policy engine. Eagle compliments the existing Hadoop >platform area by providing a comprehensive monitoring and alerting solution >for detecting sensitive data access threats based on preset policies and >machine learning models for user behaviour analysis. > >Known Risks > >Orphaned Products >The core developers of Eagle team work full time on this project. There is no >risk of Eagle getting orphaned since eBay is extensively using it in their >production Hadoop clusters and have plans to go beyond hadoop. For example, >currently there are 7 hadoop clusters and 2 of them are being monitored using >Hadoop Eagle in production. We have plans to extend it to all hadoop clusters >and eventually other data platforms. There are 10’s of policies onboarded and >actively monitored with plans to onboard more use case. We are very confident >that every hadoop cluster in the world will be monitored using Eagle for >securing the hadoop ecosystem by actively monitoring for data access on >sensitive data. We plan to extend and diversify this community further through >Apache. We presented Eagle at the hadoop summit in china and garnered interest >from different companies who use hadoop extensively. > >Inexperience with Open Source >The core developers are all active users and followers of open source. They >are already committers and contributors to the Eagle Github project. All have >been involved with the source code that has been released under an open source >license, and several of them also have experience developing code in an open >source environment. Though the core set of Developers do not have Apache Open >Source experience, there are plans to onboard individuals with Apache open >source experience on to the project. Apache Kylin PMC members are also in the >same ebay organization. We work very closely with Apache Ranger committers and >are looking forward to find meaningful integrations to improve the security of >hadoop platform. > >Homogenous Developers >The core developers are from eBay. Today the problem of monitoring data >activities to find and stop threats is a universal problem faced by all the >businesses. Apache Incubation process encourages an open and diverse >meritocratic community. Eagle intends to make every possible effort to build a >diverse, vibrant and involved community and has already received substantial >interest from various organizations. > >Reliance on Salaried Developers >eBay invested in Eagle as the monitoring solution for Hadoop clusters and some >of its key engineers are working full time on the project. In addition, since >there is a growing need for securing sensitive data access we need a data >activity monitoring solution for Hadoop, we look forward to other Apache >developers and researchers to contribute to the project. Additional >contributors, including Apache committers have plans to join this effort >shortly. Also key to addressing the risk associated with relying on Salaried >developers from a single entity is to increase the diversity of the >contributors and actively lobby for Domain experts in the security space to >contribute. Eagle intends to do this. > >Relationships with Other Apache Products >Eagle has a strong relationship and dependency with Apache Hadoop, HBase, >Spark, Kafka and Storm. Being part of Apache’s Incubation community, could >help with a closer collaboration among these projects and as well as others. >An Excessive Fascination with the Apache Brand Eagle is proposing to enter >incubation at Apache in order to help efforts to diversify the committer-base, >not so much to capitalize on the Apache brand. The Eagle project is in >production use already inside eBay, but is not expected to be an eBay product >for external customers. As such, the Eagle project is not seeking to use the >Apache brand as a marketing tool. > >Documentation >Information about Eagle can be found at https://github.com/eBay/Eagle. The >following link provide more information about Eagle >http://goeagle.io<http://goeagle.io/>. > >Initial Source >Eagle has been under development since 2014 by a team of engineers at eBay >Inc. It is currently hosted on Github.com under an Apache license 2.0 at >https://github.com/eBay/Eagle. Once in incubation we will be moving the code >base to apache git library. > >External Dependencies >Eagle has the following external dependencies. >Basic >•JDK 1.7+ >•Scala 2.10.4 >•Apache Maven >•JUnit >•Log4j >•Slf4j >•Apache Commons >•Apache Commons Math3 >•Jackson >•Siddhi CEP engine > >Hadoop >•Apache Hadoop >•Apache HBase >•Apache Hive >•Apache Zookeeper >•Apache Curator > >Apache Spark >•Spark Core Library > >REST Service >•Jersey > >Query >•Antlr > >Stream processing >•Apache Storm >•Apache Kafka > >Web >•AngularJS >•jQuery >•Bootstrap V3 >•Moment JS >•Admin LTE >•html5shiv >•respond >•Fastclick >•Date Range Picker >•Flot JS > >Cryptography >Eagle will eventually support encryption on the wire. This is not one of the >initial goals, and we do not expect Eagle to be a controlled export item due >to the use of encryption. Eagle supports but does not require the Kerberos >authentication mechanism to access secured Hadoop services. > >Required Resources > >Mailing List >•eagle-private for private PMC discussions >•eagle-dev for developers >•eagle-commits for all commits >•eagle-users for all eagle users > >Subversion Directory >•Git is the preferred source control system. > >Issue Tracking >•JIRA Eagle (Eagle) > >Other Resources >The existing code already has unit tests so we will make use of existing >Apache continuous testing infrastructure. The resulting load should not be >very large. > >Initial Committers >•Seshu Adunuthula <sadunuthula at ebay dot com> >•Arun Manoharan <armanoharan at ebay dot com> >•Edward Zhang <yonzhang at ebay dot com> >•Hao Chen <hchen9 at ebay dot com> >•Chaitali Gupta <cgupta at ebay dot com> >•Libin Sun <libsun at ebay dot com> >•Jilin Jiang <jiljiang at ebay dot com> >•Qingwen Zhao <qingwzhao at ebay dot com> >•Hemanth Dendukuri <hdendukuri at ebay dot com> >•Senthil Kumar <senthilkumar at ebay dot com> > > >Affiliations >The initial committers are employees of eBay Inc. > >Sponsors > >Champion >•Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > >Nominated Mentors >•Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Hortonworks >•Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >•Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks >•Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC member >•Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, Hortonworks > >Sponsoring Entity >We are requesting the Incubator to sponsor this project. >