+1 (non-binding)

"Manoharan, Arun" <armanoha...@ebay.com>编写:

>Hello Everyone,
>
>Thanks for all the feedback on the Eagle Proposal.
>
>I would like to call for a [VOTE] on Eagle joining the ASF as an incubation 
>project.
>
>The vote is open for 72 hours:
>
>[ ] +1 accept Eagle in the Incubator
>[ ] ±0
>[ ] -1 (please give reason)
>
>Eagle is a Monitoring solution for Hadoop to instantly identify access to 
>sensitive data, recognize attacks, malicious activities and take actions in 
>real time. Eagle supports a wide variety of policies on HDFS data and Hive. 
>Eagle also provides machine learning models for detecting anomalous user 
>behavior in Hadoop.
>
>The proposal is available on the wiki here:
>https://wiki.apache.org/incubator/EagleProposal
>
>The text of the proposal is also available at the end of this email.
>
>Thanks for your time and help.
>
>Thanks,
>Arun
>
><COPY of the proposal in text format>
>
>Eagle
>
>Abstract
>Eagle is an Open Source Monitoring solution for Hadoop to instantly identify 
>access to sensitive data, recognize attacks, malicious activities in hadoop 
>and take actions.
>
>Proposal
>Eagle audits access to HDFS files, Hive and HBase tables in real time, 
>enforces policies defined on sensitive data access and alerts or blocks user’s 
>access to that sensitive data in real time. Eagle also creates user profiles 
>based on the typical access behaviour for HDFS and Hive and sends alerts when 
>anomalous behaviour is detected. Eagle can also import sensitive data 
>information classified by external classification engines to help define its 
>policies.
>
>Overview of Eagle
>Eagle has 3 main parts.
>1.Data collection and storage - Eagle collects data from various hadoop logs 
>in real time using Kafka/Yarn API and uses HDFS and HBase for storage.
>2.Data processing and policy engine - Eagle allows users to create policies 
>based on various metadata properties on HDFS, Hive and HBase data.
>3.Eagle services - Eagle services include policy manager, query service and 
>the visualization component. Eagle provides intuitive user interface to 
>administer Eagle and an alert dashboard to respond to real time alerts.
>
>Data Collection and Storage:
>Eagle provides programming API for extending Eagle to integrate any data 
>source into Eagle policy evaluation framework. For example, Eagle hdfs audit 
>monitoring collects data from Kafka which is populated from namenode log4j 
>appender or from logstash agent. Eagle hive monitoring collects hive query 
>logs from running job through YARN API, which is designed to be scalable and 
>fault-tolerant. Eagle uses HBase as storage for storing metadata and metrics 
>data, and also supports relational database through configuration change.
>
>Data Processing and Policy Engine:
>Processing Engine: Eagle provides stream processing API which is an 
>abstraction of Apache Storm. It can also be extended to other streaming 
>engines. This abstraction allows developers to assemble data transformation, 
>filtering, external data join etc. without physically bound to a specific 
>streaming platform. Eagle streaming API allows developers to easily integrate 
>business logic with Eagle policy engine and internally Eagle framework 
>compiles business logic execution DAG into program primitives of underlying 
>stream infrastructure e.g. Apache Storm. For example, Eagle HDFS monitoring 
>transforms audit log from Namenode to object and joins sensitivity metadata, 
>security zone metadata which are generated from external programs or 
>configured by user. Eagle hive monitoring filters running jobs to get hive 
>query string and parses query string into object and then joins sensitivity 
>metadata.
>Alerting Framework: Eagle Alert Framework includes stream metadata API, 
>scalable policy engine framework, extensible policy engine framework. Stream 
>metadata API allows developers to declare event schema including what 
>attributes constitute an event, what is the type for each attribute, and how 
>to dynamically resolve attribute value in runtime when user configures policy. 
>Scalable policy engine framework allows policies to be executed on different 
>physical nodes in parallel. It is also used to define your own policy 
>partitioner class. Policy engine framework together with streaming 
>partitioning capability provided by all streaming platforms will make sure 
>policies and events can be evaluated in a fully distributed way. Extensible 
>policy engine framework allows developer to plugin a new policy engine with a 
>few lines of codes. WSO2 Siddhi CEP engine is the policy engine which Eagle 
>supports as first-class citizen.
>Machine Learning module: Eagle provides capabilities to define user activity 
>patterns or user profiles for Hadoop users based on the user behaviour in the 
>platform. These user profiles are modeled using Machine Learning algorithms 
>and used for detection of anomalous users activities. Eagle uses Eigen Value 
>Decomposition, and Density Estimation algorithms for generating user profile 
>models. The model reads data from HDFS audit logs, preprocesses and aggregates 
>data, and generates models using Spark programming APIs. Once models are 
>generated, Eagle uses stream processing engine for near real-time anomaly 
>detection to determine if any user’s activities are suspicious or not.
>
>Eagle Services:
>Query Service: Eagle provides SQL-like service API to support comprehensive 
>computation for huge set of data on the fly, for e.g. comprehensive filtering, 
>aggregation, histogram, sorting, top, arithmetical expression, pagination etc. 
>HBase is the data storage which Eagle supports as first-class citizen, 
>relational database is supported as well. For HBase storage, Eagle query 
>framework compiles user provided SQL-like query into HBase native filter 
>objects and execute it through HBase coprocessor on the fly.
>Policy Manager: Eagle policy manager provides UI and Restful API for user to 
>define policy with just a few clicks. It includes site management UI, policy 
>editor, sensitivity metadata import, HDFS or Hive sensitive resource browsing, 
>alert dashboards etc.
>Background
>Data is one of the most important assets for today’s businesses, which makes 
>data security one of the top priorities of today’s enterprises. Hadoop is 
>widely used across different verticals as a big data repository to store this 
>data in most modern enterprises.
>At eBay we use hadoop platform extensively for our data processing needs. Our 
>data in Hadoop is becoming bigger and bigger as our user base is seeing an 
>exponential growth. Today there are variety of data sets available in Hadoop 
>cluster for our users to consume. eBay has around 120 PB of data stored in 
>HDFS across 6 different clusters and around 1800+ active hadoop users 
>consuming data thru Hive, HBase and mapreduce jobs everyday to build 
>applications using this data. With this astronomical growth of data there are 
>also challenges in securing sensitive data and monitoring the access to this 
>sensitive data. Today in large organizations HDFS is the defacto standard for 
>storing big data. Data sets which includes and not limited to consumer 
>sentiment, social media data, customer segmentation, web clicks, sensor data, 
>geo-location and transaction data get stored in Hadoop for day to day business 
>needs.
>We at eBay want to make sure the sensitive data and data platforms are 
>completely protected from security breaches. So we partnered very closely with 
>our Information Security team to understand the requirements for Eagle to 
>monitor sensitive data access on hadoop:
>1.Ability to identify and stop security threats in real time
>2.Scale for big data (Support PB scale and Billions of events)
>3.Ability to create data access policies
>4.Support multiple data sources like HDFS, HBase, Hive
>5.Visualize alerts in real time
>6.Ability to block malicious access in real time
>We did not find any data access monitoring solution that available today and 
>can provide the features and functionality that we need to monitor the data 
>access in the hadoop ecosystem at our scale. Hence with an excellent team of 
>world class developers and several users, we have been able to bring Eagle 
>into production as well as open source it.
>
>Rationale
>In today’s world; data is an important asset for any company. Businesses are 
>using data extensively to create amazing experiences for users. Data has to be 
>protected and access to data should be secured from security breaches. Today 
>Hadoop is not only used to store logs but also stores financial data, 
>sensitive data sets, geographical data, user click stream data sets etc. which 
>makes it more important to be protected from security breaches. To secure a 
>data platform there are multiple things that need to happen. One is having a 
>strong access control mechanism which today is provided by Apache Ranger and 
>Apache Sentry. These tools provide the ability to provide fine grain access 
>control mechanism to data sets on hadoop. But there is a big gap in terms of 
>monitoring all the data access events and activities in order to securing the 
>hadoop data platform. Together with strong access control, perimeter security 
>and data access monitoring in place data in the hadoop clusters can be secured 
>against breaches. We looked around and found following:
>Existing data activity monitoring products are designed for traditional 
>databases and data warehouse. Existing monitoring platforms cannot scale out 
>to support fast growing data and petabyte scale. Few products in the industry 
>are still very early in terms of supporting HDFS, Hive, HBase data access 
>monitoring.
>As mentioned in the background, the business requirement and urgency to secure 
>the data from users with malicious intent drove eBay to invest in building a 
>real time data access monitoring solution from scratch to offer real time 
>alerts and remediation features for malicious data access.
>With the power of open source distributed systems like Hadoop, Kafka and much 
>more we were able to develop a data activity monitoring system that can scale, 
>identify and stop malicious access in real time.
>Eagle allows admins to create standard access policies and rules for 
>monitoring HDFS, Hive and HBase data. Eagle also provides out of box machine 
>learning models for modeling user profiles based on user access behaviour and 
>use the model to alert on anomalies.
>
>Current Status
>
>Meritocracy
>Eagle has been deployed in production at eBay for monitoring billions of 
>events per day from HDFS and Hive operations. From the start; the product has 
>been built with focus on high scalability and application extensibility in 
>mind and Eagle has demonstrated great performance in responding to suspicious 
>events instantly and great flexibility in defining policy.
>
>Community
>Eagle seeks to develop the developer and user communities during incubation.
>
>Core Developers
>Eagle is currently being designed and developed by engineers from eBay Inc. – 
>Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, Qingwen Zhao, 
>Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of these core developers 
>have deep expertise in developing monitoring products for the Hadoop ecosystem.
>
>Alignment
>The ASF is a natural host for Eagle given that it is already the home of 
>Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data projects. 
>Eagle leverages lot of Apache open-source products. Eagle was designed to 
>offer real time insights into sensitive data access by actively monitoring the 
>data access on various data sets in hadoop and an extensible alerting 
>framework with a powerful policy engine. Eagle compliments the existing Hadoop 
>platform area by providing a comprehensive monitoring and alerting solution 
>for detecting sensitive data access threats based on preset policies and 
>machine learning models for user behaviour analysis.
>
>Known Risks
>
>Orphaned Products
>The core developers of Eagle team work full time on this project. There is no 
>risk of Eagle getting orphaned since eBay is extensively using it in their 
>production Hadoop clusters and have plans to go beyond hadoop. For example, 
>currently there are 7 hadoop clusters and 2 of them are being monitored using 
>Hadoop Eagle in production. We have plans to extend it to all hadoop clusters 
>and eventually other data platforms. There are 10’s of policies onboarded and 
>actively monitored with plans to onboard more use case. We are very confident 
>that every hadoop cluster in the world will be monitored using Eagle for 
>securing the hadoop ecosystem by actively monitoring for data access on 
>sensitive data. We plan to extend and diversify this community further through 
>Apache. We presented Eagle at the hadoop summit in china and garnered interest 
>from different companies who use hadoop extensively.
>
>Inexperience with Open Source
>The core developers are all active users and followers of open source. They 
>are already committers and contributors to the Eagle Github project. All have 
>been involved with the source code that has been released under an open source 
>license, and several of them also have experience developing code in an open 
>source environment. Though the core set of Developers do not have Apache Open 
>Source experience, there are plans to onboard individuals with Apache open 
>source experience on to the project. Apache Kylin PMC members are also in the 
>same ebay organization. We work very closely with Apache Ranger committers and 
>are looking forward to find meaningful integrations to improve the security of 
>hadoop platform.
>
>Homogenous Developers
>The core developers are from eBay. Today the problem of monitoring data 
>activities to find and stop threats is a universal problem faced by all the 
>businesses. Apache Incubation process encourages an open and diverse 
>meritocratic community. Eagle intends to make every possible effort to build a 
>diverse, vibrant and involved community and has already received substantial 
>interest from various organizations.
>
>Reliance on Salaried Developers
>eBay invested in Eagle as the monitoring solution for Hadoop clusters and some 
>of its key engineers are working full time on the project. In addition, since 
>there is a growing need for securing sensitive data access we need a data 
>activity monitoring solution for Hadoop, we look forward to other Apache 
>developers and researchers to contribute to the project. Additional 
>contributors, including Apache committers have plans to join this effort 
>shortly. Also key to addressing the risk associated with relying on Salaried 
>developers from a single entity is to increase the diversity of the 
>contributors and actively lobby for Domain experts in the security space to 
>contribute. Eagle intends to do this.
>
>Relationships with Other Apache Products
>Eagle has a strong relationship and dependency with Apache Hadoop, HBase, 
>Spark, Kafka and Storm. Being part of Apache’s Incubation community, could 
>help with a closer collaboration among these projects and as well as others. 
>An Excessive Fascination with the Apache Brand Eagle is proposing to enter 
>incubation at Apache in order to help efforts to diversify the committer-base, 
>not so much to capitalize on the Apache brand. The Eagle project is in 
>production use already inside eBay, but is not expected to be an eBay product 
>for external customers. As such, the Eagle project is not seeking to use the 
>Apache brand as a marketing tool.
>
>Documentation
>Information about Eagle can be found at https://github.com/eBay/Eagle. The 
>following link provide more information about Eagle 
>http://goeagle.io<http://goeagle.io/>.
>
>Initial Source
>Eagle has been under development since 2014 by a team of engineers at eBay 
>Inc. It is currently hosted on Github.com under an Apache license 2.0 at 
>https://github.com/eBay/Eagle. Once in incubation we will be moving the code 
>base to apache git library.
>
>External Dependencies
>Eagle has the following external dependencies.
>Basic
>•JDK 1.7+
>•Scala 2.10.4
>•Apache Maven
>•JUnit
>•Log4j
>•Slf4j
>•Apache Commons
>•Apache Commons Math3
>•Jackson
>•Siddhi CEP engine
>
>Hadoop
>•Apache Hadoop
>•Apache HBase
>•Apache Hive
>•Apache Zookeeper
>•Apache Curator
>
>Apache Spark
>•Spark Core Library
>
>REST Service
>•Jersey
>
>Query
>•Antlr
>
>Stream processing
>•Apache Storm
>•Apache Kafka
>
>Web
>•AngularJS
>•jQuery
>•Bootstrap V3
>•Moment JS
>•Admin LTE
>•html5shiv
>•respond
>•Fastclick
>•Date Range Picker
>•Flot JS
>
>Cryptography
>Eagle will eventually support encryption on the wire. This is not one of the 
>initial goals, and we do not expect Eagle to be a controlled export item due 
>to the use of encryption. Eagle supports but does not require the Kerberos 
>authentication mechanism to access secured Hadoop services.
>
>Required Resources
>
>Mailing List
>•eagle-private for private PMC discussions
>•eagle-dev for developers
>•eagle-commits for all commits
>•eagle-users for all eagle users
>
>Subversion Directory
>•Git is the preferred source control system.
>
>Issue Tracking
>•JIRA Eagle (Eagle)
>
>Other Resources
>The existing code already has unit tests so we will make use of existing 
>Apache continuous testing infrastructure. The resulting load should not be 
>very large.
>
>Initial Committers
>•Seshu Adunuthula <sadunuthula at ebay dot com>
>•Arun Manoharan <armanoharan at ebay dot com>
>•Edward Zhang <yonzhang at ebay dot com>
>•Hao Chen <hchen9 at ebay dot com>
>•Chaitali Gupta <cgupta at ebay dot com>
>•Libin Sun <libsun at ebay dot com>
>•Jilin Jiang <jiljiang at ebay dot com>
>•Qingwen Zhao <qingwzhao at ebay dot com>
>•Hemanth Dendukuri <hdendukuri at ebay dot com>
>•Senthil Kumar <senthilkumar at ebay dot com>
>
>
>Affiliations
>The initial committers are employees of eBay Inc.
>
>Sponsors
>
>Champion
>•Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>
>Nominated Mentors
>•Owen O’Malley < omalley at apache dot org > - Apache IPMC member, Hortonworks
>•Henry Saputra <hsaputra at apache dot org> - Apache IPMC member
>•Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, Hortonworks
>•Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC member
>•Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, Hortonworks
>
>Sponsoring Entity
>We are requesting the Incubator to sponsor this project.
>

Reply via email to