+1 -Medha On 10/23/15, 1:14 PM, "Balaji Ganesan" <bgane...@apache.org> wrote:
>+1 > >On Fri, Oct 23, 2015 at 12:26 PM, Chris Nauroth <cnaur...@hortonworks.com> >wrote: > >> +1 (binding) >> >> --Chris Nauroth >> >> >> >> >> On 10/23/15, 7:11 AM, "Manoharan, Arun" <armanoha...@ebay.com> wrote: >> >> >Hello Everyone, >> > >> >Thanks for all the feedback on the Eagle Proposal. >> > >> >I would like to call for a [VOTE] on Eagle joining the ASF as an >> >incubation project. >> > >> >The vote is open for 72 hours: >> > >> >[ ] +1 accept Eagle in the Incubator >> >[ ] ±0 >> >[ ] -1 (please give reason) >> > >> >Eagle is a Monitoring solution for Hadoop to instantly identify access >>to >> >sensitive data, recognize attacks, malicious activities and take >>actions >> >in real time. Eagle supports a wide variety of policies on HDFS data >>and >> >Hive. Eagle also provides machine learning models for detecting >>anomalous >> >user behavior in Hadoop. >> > >> >The proposal is available on the wiki here: >> >https://wiki.apache.org/incubator/EagleProposal >> > >> >The text of the proposal is also available at the end of this email. >> > >> >Thanks for your time and help. >> > >> >Thanks, >> >Arun >> > >> ><COPY of the proposal in text format> >> > >> >Eagle >> > >> >Abstract >> >Eagle is an Open Source Monitoring solution for Hadoop to instantly >> >identify access to sensitive data, recognize attacks, malicious >> >activities in hadoop and take actions. >> > >> >Proposal >> >Eagle audits access to HDFS files, Hive and HBase tables in real time, >> >enforces policies defined on sensitive data access and alerts or blocks >> >user¹s access to that sensitive data in real time. Eagle also creates >> >user profiles based on the typical access behaviour for HDFS and Hive >>and >> >sends alerts when anomalous behaviour is detected. Eagle can also >>import >> >sensitive data information classified by external classification >>engines >> >to help define its policies. >> > >> >Overview of Eagle >> >Eagle has 3 main parts. >> >1.Data collection and storage - Eagle collects data from various hadoop >> >logs in real time using Kafka/Yarn API and uses HDFS and HBase for >> >storage. >> >2.Data processing and policy engine - Eagle allows users to create >> >policies based on various metadata properties on HDFS, Hive and HBase >> >data. >> >3.Eagle services - Eagle services include policy manager, query service >> >and the visualization component. Eagle provides intuitive user >>interface >> >to administer Eagle and an alert dashboard to respond to real time >>alerts. >> > >> >Data Collection and Storage: >> >Eagle provides programming API for extending Eagle to integrate any >>data >> >source into Eagle policy evaluation framework. For example, Eagle hdfs >> >audit monitoring collects data from Kafka which is populated from >> >namenode log4j appender or from logstash agent. Eagle hive monitoring >> >collects hive query logs from running job through YARN API, which is >> >designed to be scalable and fault-tolerant. Eagle uses HBase as storage >> >for storing metadata and metrics data, and also supports relational >> >database through configuration change. >> > >> >Data Processing and Policy Engine: >> >Processing Engine: Eagle provides stream processing API which is an >> >abstraction of Apache Storm. It can also be extended to other streaming >> >engines. This abstraction allows developers to assemble data >> >transformation, filtering, external data join etc. without physically >> >bound to a specific streaming platform. Eagle streaming API allows >> >developers to easily integrate business logic with Eagle policy engine >> >and internally Eagle framework compiles business logic execution DAG >>into >> >program primitives of underlying stream infrastructure e.g. Apache >>Storm. >> >For example, Eagle HDFS monitoring transforms audit log from Namenode >>to >> >object and joins sensitivity metadata, security zone metadata which are >> >generated from external programs or configured by user. Eagle hive >> >monitoring filters running jobs to get hive query string and parses >>query >> >string into object and then joins sensitivity metadata. >> >Alerting Framework: Eagle Alert Framework includes stream metadata API, >> >scalable policy engine framework, extensible policy engine framework. >> >Stream metadata API allows developers to declare event schema including >> >what attributes constitute an event, what is the type for each >>attribute, >> >and how to dynamically resolve attribute value in runtime when user >> >configures policy. Scalable policy engine framework allows policies to >>be >> >executed on different physical nodes in parallel. It is also used to >> >define your own policy partitioner class. Policy engine framework >> >together with streaming partitioning capability provided by all >>streaming >> >platforms will make sure policies and events can be evaluated in a >>fully >> >distributed way. Extensible policy engine framework allows developer to >> >plugin a new policy engine with a few lines of codes. WSO2 Siddhi CEP >> >engine is the policy engine which Eagle supports as first-class >>citizen. >> >Machine Learning module: Eagle provides capabilities to define user >> >activity patterns or user profiles for Hadoop users based on the user >> >behaviour in the platform. These user profiles are modeled using >>Machine >> >Learning algorithms and used for detection of anomalous users >>activities. >> >Eagle uses Eigen Value Decomposition, and Density Estimation algorithms >> >for generating user profile models. The model reads data from HDFS >>audit >> >logs, preprocesses and aggregates data, and generates models using >>Spark >> >programming APIs. Once models are generated, Eagle uses stream >>processing >> >engine for near real-time anomaly detection to determine if any user¹s >> >activities are suspicious or not. >> > >> >Eagle Services: >> >Query Service: Eagle provides SQL-like service API to support >> >comprehensive computation for huge set of data on the fly, for e.g. >> >comprehensive filtering, aggregation, histogram, sorting, top, >> >arithmetical expression, pagination etc. HBase is the data storage >>which >> >Eagle supports as first-class citizen, relational database is supported >> >as well. For HBase storage, Eagle query framework compiles user >>provided >> >SQL-like query into HBase native filter objects and execute it through >> >HBase coprocessor on the fly. >> >Policy Manager: Eagle policy manager provides UI and Restful API for >>user >> >to define policy with just a few clicks. It includes site management >>UI, >> >policy editor, sensitivity metadata import, HDFS or Hive sensitive >> >resource browsing, alert dashboards etc. >> >Background >> >Data is one of the most important assets for today¹s businesses, which >> >makes data security one of the top priorities of today¹s enterprises. >> >Hadoop is widely used across different verticals as a big data >>repository >> >to store this data in most modern enterprises. >> >At eBay we use hadoop platform extensively for our data processing >>needs. >> >Our data in Hadoop is becoming bigger and bigger as our user base is >> >seeing an exponential growth. Today there are variety of data sets >> >available in Hadoop cluster for our users to consume. eBay has around >>120 >> >PB of data stored in HDFS across 6 different clusters and around 1800+ >> >active hadoop users consuming data thru Hive, HBase and mapreduce jobs >> >everyday to build applications using this data. With this astronomical >> >growth of data there are also challenges in securing sensitive data and >> >monitoring the access to this sensitive data. Today in large >> >organizations HDFS is the defacto standard for storing big data. Data >> >sets which includes and not limited to consumer sentiment, social media >> >data, customer segmentation, web clicks, sensor data, geo-location and >> >transaction data get stored in Hadoop for day to day business needs. >> >We at eBay want to make sure the sensitive data and data platforms are >> >completely protected from security breaches. So we partnered very >>closely >> >with our Information Security team to understand the requirements for >> >Eagle to monitor sensitive data access on hadoop: >> >1.Ability to identify and stop security threats in real time >> >2.Scale for big data (Support PB scale and Billions of events) >> >3.Ability to create data access policies >> >4.Support multiple data sources like HDFS, HBase, Hive >> >5.Visualize alerts in real time >> >6.Ability to block malicious access in real time >> >We did not find any data access monitoring solution that available >>today >> >and can provide the features and functionality that we need to monitor >> >the data access in the hadoop ecosystem at our scale. Hence with an >> >excellent team of world class developers and several users, we have >>been >> >able to bring Eagle into production as well as open source it. >> > >> >Rationale >> >In today¹s world; data is an important asset for any company. >>Businesses >> >are using data extensively to create amazing experiences for users. >>Data >> >has to be protected and access to data should be secured from security >> >breaches. Today Hadoop is not only used to store logs but also stores >> >financial data, sensitive data sets, geographical data, user click >>stream >> >data sets etc. which makes it more important to be protected from >> >security breaches. To secure a data platform there are multiple things >> >that need to happen. One is having a strong access control mechanism >> >which today is provided by Apache Ranger and Apache Sentry. These tools >> >provide the ability to provide fine grain access control mechanism to >> >data sets on hadoop. But there is a big gap in terms of monitoring all >> >the data access events and activities in order to securing the hadoop >> >data platform. Together with strong access control, perimeter security >> >and data access monitoring in place data in the hadoop clusters can be >> >secured against breaches. We looked around and found following: >> >Existing data activity monitoring products are designed for traditional >> >databases and data warehouse. Existing monitoring platforms cannot >>scale >> >out to support fast growing data and petabyte scale. Few products in >>the >> >industry are still very early in terms of supporting HDFS, Hive, HBase >> >data access monitoring. >> >As mentioned in the background, the business requirement and urgency to >> >secure the data from users with malicious intent drove eBay to invest >>in >> >building a real time data access monitoring solution from scratch to >> >offer real time alerts and remediation features for malicious data >>access. >> >With the power of open source distributed systems like Hadoop, Kafka >>and >> >much more we were able to develop a data activity monitoring system >>that >> >can scale, identify and stop malicious access in real time. >> >Eagle allows admins to create standard access policies and rules for >> >monitoring HDFS, Hive and HBase data. Eagle also provides out of box >> >machine learning models for modeling user profiles based on user access >> >behaviour and use the model to alert on anomalies. >> > >> >Current Status >> > >> >Meritocracy >> >Eagle has been deployed in production at eBay for monitoring billions >>of >> >events per day from HDFS and Hive operations. From the start; the >>product >> >has been built with focus on high scalability and application >> >extensibility in mind and Eagle has demonstrated great performance in >> >responding to suspicious events instantly and great flexibility in >> >defining policy. >> > >> >Community >> >Eagle seeks to develop the developer and user communities during >> >incubation. >> > >> >Core Developers >> >Eagle is currently being designed and developed by engineers from eBay >> >Inc. Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, >> >Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of >> >these core developers have deep expertise in developing monitoring >> >products for the Hadoop ecosystem. >> > >> >Alignment >> >The ASF is a natural host for Eagle given that it is already the home >>of >> >Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data >> >projects. Eagle leverages lot of Apache open-source products. Eagle was >> >designed to offer real time insights into sensitive data access by >> >actively monitoring the data access on various data sets in hadoop and >>an >> >extensible alerting framework with a powerful policy engine. Eagle >> >compliments the existing Hadoop platform area by providing a >> >comprehensive monitoring and alerting solution for detecting sensitive >> >data access threats based on preset policies and machine learning >>models >> >for user behaviour analysis. >> > >> >Known Risks >> > >> >Orphaned Products >> >The core developers of Eagle team work full time on this project. There >> >is no risk of Eagle getting orphaned since eBay is extensively using it >> >in their production Hadoop clusters and have plans to go beyond hadoop. >> >For example, currently there are 7 hadoop clusters and 2 of them are >> >being monitored using Hadoop Eagle in production. We have plans to >>extend >> >it to all hadoop clusters and eventually other data platforms. There >>are >> >10¹s of policies onboarded and actively monitored with plans to onboard >> >more use case. We are very confident that every hadoop cluster in the >> >world will be monitored using Eagle for securing the hadoop ecosystem >>by >> >actively monitoring for data access on sensitive data. We plan to >>extend >> >and diversify this community further through Apache. We presented Eagle >> >at the hadoop summit in china and garnered interest from different >> >companies who use hadoop extensively. >> > >> >Inexperience with Open Source >> >The core developers are all active users and followers of open source. >> >They are already committers and contributors to the Eagle Github >>project. >> >All have been involved with the source code that has been released >>under >> >an open source license, and several of them also have experience >> >developing code in an open source environment. Though the core set of >> >Developers do not have Apache Open Source experience, there are plans >>to >> >onboard individuals with Apache open source experience on to the >>project. >> >Apache Kylin PMC members are also in the same ebay organization. We >>work >> >very closely with Apache Ranger committers and are looking forward to >> >find meaningful integrations to improve the security of hadoop >>platform. >> > >> >Homogenous Developers >> >The core developers are from eBay. Today the problem of monitoring data >> >activities to find and stop threats is a universal problem faced by all >> >the businesses. Apache Incubation process encourages an open and >>diverse >> >meritocratic community. Eagle intends to make every possible effort to >> >build a diverse, vibrant and involved community and has already >>received >> >substantial interest from various organizations. >> > >> >Reliance on Salaried Developers >> >eBay invested in Eagle as the monitoring solution for Hadoop clusters >>and >> >some of its key engineers are working full time on the project. In >> >addition, since there is a growing need for securing sensitive data >> >access we need a data activity monitoring solution for Hadoop, we look >> >forward to other Apache developers and researchers to contribute to the >> >project. Additional contributors, including Apache committers have >>plans >> >to join this effort shortly. Also key to addressing the risk associated >> >with relying on Salaried developers from a single entity is to increase >> >the diversity of the contributors and actively lobby for Domain experts >> >in the security space to contribute. Eagle intends to do this. >> > >> >Relationships with Other Apache Products >> >Eagle has a strong relationship and dependency with Apache Hadoop, >>HBase, >> >Spark, Kafka and Storm. Being part of Apache¹s Incubation community, >> >could help with a closer collaboration among these projects and as well >> >as others. An Excessive Fascination with the Apache Brand Eagle is >> >proposing to enter incubation at Apache in order to help efforts to >> >diversify the committer-base, not so much to capitalize on the Apache >> >brand. The Eagle project is in production use already inside eBay, but >>is >> >not expected to be an eBay product for external customers. As such, the >> >Eagle project is not seeking to use the Apache brand as a marketing >>tool. >> > >> >Documentation >> >Information about Eagle can be found at https://github.com/eBay/Eagle. >> >The following link provide more information about Eagle >> >http://goeagle.io<http://goeagle.io/>. >> > >> >Initial Source >> >Eagle has been under development since 2014 by a team of engineers at >> >eBay Inc. It is currently hosted on Github.com under an Apache license >> >2.0 at https://github.com/eBay/Eagle. Once in incubation we will be >> >moving the code base to apache git library. >> > >> >External Dependencies >> >Eagle has the following external dependencies. >> >Basic >> >€JDK 1.7+ >> >€Scala 2.10.4 >> >€Apache Maven >> >€JUnit >> >€Log4j >> >€Slf4j >> >€Apache Commons >> >€Apache Commons Math3 >> >€Jackson >> >€Siddhi CEP engine >> > >> >Hadoop >> >€Apache Hadoop >> >€Apache HBase >> >€Apache Hive >> >€Apache Zookeeper >> >€Apache Curator >> > >> >Apache Spark >> >€Spark Core Library >> > >> >REST Service >> >€Jersey >> > >> >Query >> >€Antlr >> > >> >Stream processing >> >€Apache Storm >> >€Apache Kafka >> > >> >Web >> >€AngularJS >> >€jQuery >> >€Bootstrap V3 >> >€Moment JS >> >€Admin LTE >> >€html5shiv >> >€respond >> >€Fastclick >> >€Date Range Picker >> >€Flot JS >> > >> >Cryptography >> >Eagle will eventually support encryption on the wire. This is not one >>of >> >the initial goals, and we do not expect Eagle to be a controlled export >> >item due to the use of encryption. Eagle supports but does not require >> >the Kerberos authentication mechanism to access secured Hadoop >>services. >> > >> >Required Resources >> > >> >Mailing List >> >€eagle-private for private PMC discussions >> >€eagle-dev for developers >> >€eagle-commits for all commits >> >€eagle-users for all eagle users >> > >> >Subversion Directory >> >€Git is the preferred source control system. >> > >> >Issue Tracking >> >€JIRA Eagle (Eagle) >> > >> >Other Resources >> >The existing code already has unit tests so we will make use of >>existing >> >Apache continuous testing infrastructure. The resulting load should not >> >be very large. >> > >> >Initial Committers >> >€Seshu Adunuthula <sadunuthula at ebay dot com> >> >€Arun Manoharan <armanoharan at ebay dot com> >> >€Edward Zhang <yonzhang at ebay dot com> >> >€Hao Chen <hchen9 at ebay dot com> >> >€Chaitali Gupta <cgupta at ebay dot com> >> >€Libin Sun <libsun at ebay dot com> >> >€Jilin Jiang <jiljiang at ebay dot com> >> >€Qingwen Zhao <qingwzhao at ebay dot com> >> >€Hemanth Dendukuri <hdendukuri at ebay dot com> >> >€Senthil Kumar <senthilkumar at ebay dot com> >> > >> > >> >Affiliations >> >The initial committers are employees of eBay Inc. >> > >> >Sponsors >> > >> >Champion >> >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> > >> >Nominated Mentors >> >€Owen O¹Malley < omalley at apache dot org > - Apache IPMC member, >> >Hortonworks >> >€Henry Saputra <hsaputra at apache dot org> - Apache IPMC member >> >€Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, >> >Hortonworks >> >€Amareshwari Sriramdasu <amareshwari at apache dot org> - Apache IPMC >> >member >> >€Taylor Goetz <ptgoetz at apache dot org> - Apache IPMC member, >> >Hortonworks >> > >> >Sponsoring Entity >> >We are requesting the Incubator to sponsor this project. >> > >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org >> For additional commands, e-mail: general-h...@incubator.apache.org >> >>