Hi Arun, Eagle sounds very promising. I just had a discussion with someone about this exact need. I do however agree with Greg on the name. As far as I can see, besides the name, your weakest point is the all eBay employed team. It's not a blocker and can be fixed during incubation. Good luck to you.
Alex On Tue, Oct 20, 2015 at 5:51 PM, Manoharan, Arun <armanoha...@ebay.com> wrote: > Hi Greg, > > Thank you for reviewing the proposal. > > Originally we thought Eagle might be trademarked by someone already but I > went thru eBay legal team to get the clearance for the name to be used. We > will look into it again to see if there will be potential problems. > > Thanks, > Arun > > On 10/20/15, 1:52 AM, "Greg Stein" <gst...@gmail.com> wrote: > > >Hey there, Arun! ... I have no commentary on the proposal itself, as it > >looks like a great proposal. I would suggest being a bit wary of the name, > >as "Eagle" is a *very* popular PCB design program. > > > >On Mon, Oct 19, 2015 at 10:33 AM, Manoharan, Arun <armanoha...@ebay.com> > >wrote: > > > >> Hello Everyone, > >> > >> My name is Arun Manoharan. Currently a product manager in the Analytics > >> platform team at eBay Inc. > >> > >> I would like to start a discussion on Eagle and its joining the ASF as > >>an > >> incubation project. > >> > >> Eagle is a Monitoring solution for Hadoop to instantly identify access > >>to > >> sensitive data, recognize attacks, malicious activities and take > >>actions in > >> real time. Eagle supports a wide variety of policies on HDFS data and > >>Hive. > >> Eagle also provides machine learning models for detecting anomalous user > >> behavior in Hadoop. > >> > >> The proposal is available on the wiki here: > >> https://wiki.apache.org/incubator/EagleProposal > >> > >> The text of the proposal is also available at the end of this email. > >> > >> Thanks for your time and help. > >> > >> Thanks, > >> Arun > >> > >> <COPY of the proposal in text format> > >> > >> Eagle > >> > >> Abstract > >> Eagle is an Open Source Monitoring solution for Hadoop to instantly > >> identify access to sensitive data, recognize attacks, malicious > >>activities > >> in hadoop and take actions. > >> > >> Proposal > >> Eagle audits access to HDFS files, Hive and HBase tables in real time, > >> enforces policies defined on sensitive data access and alerts or blocks > >> user¹s access to that sensitive data in real time. Eagle also creates > >>user > >> profiles based on the typical access behaviour for HDFS and Hive and > >>sends > >> alerts when anomalous behaviour is detected. Eagle can also import > >> sensitive data information classified by external classification > >>engines to > >> help define its policies. > >> > >> Overview of Eagle > >> Eagle has 3 main parts. > >> 1.Data collection and storage - Eagle collects data from various hadoop > >> logs in real time using Kafka/Yarn API and uses HDFS and HBase for > >>storage. > >> 2.Data processing and policy engine - Eagle allows users to create > >> policies based on various metadata properties on HDFS, Hive and HBase > >>data. > >> 3.Eagle services - Eagle services include policy manager, query service > >> and the visualization component. Eagle provides intuitive user > >>interface to > >> administer Eagle and an alert dashboard to respond to real time alerts. > >> > >> Data Collection and Storage: > >> Eagle provides programming API for extending Eagle to integrate any data > >> source into Eagle policy evaluation framework. For example, Eagle hdfs > >> audit monitoring collects data from Kafka which is populated from > >>namenode > >> log4j appender or from logstash agent. Eagle hive monitoring collects > >>hive > >> query logs from running job through YARN API, which is designed to be > >> scalable and fault-tolerant. Eagle uses HBase as storage for storing > >> metadata and metrics data, and also supports relational database through > >> configuration change. > >> > >> Data Processing and Policy Engine: > >> Processing Engine: Eagle provides stream processing API which is an > >> abstraction of Apache Storm. It can also be extended to other streaming > >> engines. This abstraction allows developers to assemble data > >> transformation, filtering, external data join etc. without physically > >>bound > >> to a specific streaming platform. Eagle streaming API allows developers > >>to > >> easily integrate business logic with Eagle policy engine and internally > >> Eagle framework compiles business logic execution DAG into program > >> primitives of underlying stream infrastructure e.g. Apache Storm. For > >> example, Eagle HDFS monitoring transforms audit log from Namenode to > >>object > >> and joins sensitivity metadata, security zone metadata which are > >>generated > >> from external programs or configured by user. Eagle hive monitoring > >>filters > >> running jobs to get hive query string and parses query string into > >>object > >> and then joins sensitivity metadata. > >> Alerting Framework: Eagle Alert Framework includes stream metadata API, > >> scalable policy engine framework, extensible policy engine framework. > >> Stream metadata API allows developers to declare event schema including > >> what attributes constitute an event, what is the type for each > >>attribute, > >> and how to dynamically resolve attribute value in runtime when user > >> configures policy. Scalable policy engine framework allows policies to > >>be > >> executed on different physical nodes in parallel. It is also used to > >>define > >> your own policy partitioner class. Policy engine framework together with > >> streaming partitioning capability provided by all streaming platforms > >>will > >> make sure policies and events can be evaluated in a fully distributed > >>way. > >> Extensible policy engine framework allows developer to plugin a new > >>policy > >> engine with a few lines of codes. WSO2 Siddhi CEP engine is the policy > >> engine which Eagle supports as first-class citizen. > >> Machine Learning module: Eagle provides capabilities to define user > >> activity patterns or user profiles for Hadoop users based on the user > >> behaviour in the platform. These user profiles are modeled using Machine > >> Learning algorithms and used for detection of anomalous users > >>activities. > >> Eagle uses Eigen Value Decomposition, and Density Estimation algorithms > >>for > >> generating user profile models. The model reads data from HDFS audit > >>logs, > >> preprocesses and aggregates data, and generates models using Spark > >> programming APIs. Once models are generated, Eagle uses stream > >>processing > >> engine for near real-time anomaly detection to determine if any user¹s > >> activities are suspicious or not. > >> > >> Eagle Services: > >> Query Service: Eagle provides SQL-like service API to support > >> comprehensive computation for huge set of data on the fly, for e.g. > >> comprehensive filtering, aggregation, histogram, sorting, top, > >>arithmetical > >> expression, pagination etc. HBase is the data storage which Eagle > >>supports > >> as first-class citizen, relational database is supported as well. For > >>HBase > >> storage, Eagle query framework compiles user provided SQL-like query > >>into > >> HBase native filter objects and execute it through HBase coprocessor on > >>the > >> fly. > >> Policy Manager: Eagle policy manager provides UI and Restful API for > >>user > >> to define policy with just a few clicks. It includes site management UI, > >> policy editor, sensitivity metadata import, HDFS or Hive sensitive > >>resource > >> browsing, alert dashboards etc. > >> Background > >> Data is one of the most important assets for today¹s businesses, which > >> makes data security one of the top priorities of today¹s enterprises. > >> Hadoop is widely used across different verticals as a big data > >>repository > >> to store this data in most modern enterprises. > >> At eBay we use hadoop platform extensively for our data processing > >>needs. > >> Our data in Hadoop is becoming bigger and bigger as our user base is > >>seeing > >> an exponential growth. Today there are variety of data sets available in > >> Hadoop cluster for our users to consume. eBay has around 120 PB of data > >> stored in HDFS across 6 different clusters and around 1800+ active > >>hadoop > >> users consuming data thru Hive, HBase and mapreduce jobs everyday to > >>build > >> applications using this data. With this astronomical growth of data > >>there > >> are also challenges in securing sensitive data and monitoring the > >>access to > >> this sensitive data. Today in large organizations HDFS is the defacto > >> standard for storing big data. Data sets which includes and not limited > >>to > >> consumer sentiment, social media data, customer segmentation, web > >>clicks, > >> sensor data, geo-location and transaction data get stored in Hadoop for > >>day > >> to day business needs. > >> We at eBay want to make sure the sensitive data and data platforms are > >> completely protected from security breaches. So we partnered very > >>closely > >> with our Information Security team to understand the requirements for > >>Eagle > >> to monitor sensitive data access on hadoop: > >> 1.Ability to identify and stop security threats in real time > >> 2.Scale for big data (Support PB scale and Billions of events) > >> 3.Ability to create data access policies > >> 4.Support multiple data sources like HDFS, HBase, Hive > >> 5.Visualize alerts in real time > >> 6.Ability to block malicious access in real time > >> We did not find any data access monitoring solution that available today > >> and can provide the features and functionality that we need to monitor > >>the > >> data access in the hadoop ecosystem at our scale. Hence with an > >>excellent > >> team of world class developers and several users, we have been able to > >> bring Eagle into production as well as open source it. > >> > >> Rationale > >> In today¹s world; data is an important asset for any company. Businesses > >> are using data extensively to create amazing experiences for users. Data > >> has to be protected and access to data should be secured from security > >> breaches. Today Hadoop is not only used to store logs but also stores > >> financial data, sensitive data sets, geographical data, user click > >>stream > >> data sets etc. which makes it more important to be protected from > >>security > >> breaches. To secure a data platform there are multiple things that need > >>to > >> happen. One is having a strong access control mechanism which today is > >> provided by Apache Ranger and Apache Sentry. These tools provide the > >> ability to provide fine grain access control mechanism to data sets on > >> hadoop. But there is a big gap in terms of monitoring all the data > >>access > >> events and activities in order to securing the hadoop data platform. > >> Together with strong access control, perimeter security and data access > >> monitoring in place data in the hadoop clusters can be secured against > >> breaches. We looked around and found following: > >> Existing data activity monitoring products are designed for traditional > >> databases and data warehouse. Existing monitoring platforms cannot scale > >> out to support fast growing data and petabyte scale. Few products in the > >> industry are still very early in terms of supporting HDFS, Hive, HBase > >>data > >> access monitoring. > >> As mentioned in the background, the business requirement and urgency to > >> secure the data from users with malicious intent drove eBay to invest in > >> building a real time data access monitoring solution from scratch to > >>offer > >> real time alerts and remediation features for malicious data access. > >> With the power of open source distributed systems like Hadoop, Kafka and > >> much more we were able to develop a data activity monitoring system that > >> can scale, identify and stop malicious access in real time. > >> Eagle allows admins to create standard access policies and rules for > >> monitoring HDFS, Hive and HBase data. Eagle also provides out of box > >> machine learning models for modeling user profiles based on user access > >> behaviour and use the model to alert on anomalies. > >> > >> Current Status > >> > >> Meritocracy > >> Eagle has been deployed in production at eBay for monitoring billions of > >> events per day from HDFS and Hive operations. From the start; the > >>product > >> has been built with focus on high scalability and application > >>extensibility > >> in mind and Eagle has demonstrated great performance in responding to > >> suspicious events instantly and great flexibility in defining policy. > >> > >> Community > >> Eagle seeks to develop the developer and user communities during > >> incubation. > >> > >> Core Developers > >> Eagle is currently being designed and developed by engineers from eBay > >> Inc. Edward Zhang, Hao Chen, Chaitali Gupta, Libin Sun, Jilin Jiang, > >> Qingwen Zhao, Senthil Kumar, Hemanth Dendukuri, Arun Manoharan. All of > >> these core developers have deep expertise in developing monitoring > >>products > >> for the Hadoop ecosystem. > >> > >> Alignment > >> The ASF is a natural host for Eagle given that it is already the home of > >> Hadoop, HBase, Hive, Storm, Kafka, Spark and other emerging big data > >> projects. Eagle leverages lot of Apache open-source products. Eagle was > >> designed to offer real time insights into sensitive data access by > >>actively > >> monitoring the data access on various data sets in hadoop and an > >>extensible > >> alerting framework with a powerful policy engine. Eagle compliments the > >> existing Hadoop platform area by providing a comprehensive monitoring > >>and > >> alerting solution for detecting sensitive data access threats based on > >> preset policies and machine learning models for user behaviour analysis. > >> > >> Known Risks > >> > >> Orphaned Products > >> The core developers of Eagle team work full time on this project. There > >>is > >> no risk of Eagle getting orphaned since eBay is extensively using it in > >> their production Hadoop clusters and have plans to go beyond hadoop. For > >> example, currently there are 7 hadoop clusters and 2 of them are being > >> monitored using Hadoop Eagle in production. We have plans to extend it > >>to > >> all hadoop clusters and eventually other data platforms. There are 10¹s > >>of > >> policies onboarded and actively monitored with plans to onboard more use > >> case. We are very confident that every hadoop cluster in the world will > >>be > >> monitored using Eagle for securing the hadoop ecosystem by actively > >> monitoring for data access on sensitive data. We plan to extend and > >> diversify this community further through Apache. We presented Eagle at > >>the > >> hadoop summit in china and garnered interest from different companies > >>who > >> use hadoop extensively. > >> > >> Inexperience with Open Source > >> The core developers are all active users and followers of open source. > >> They are already committers and contributors to the Eagle Github > >>project. > >> All have been involved with the source code that has been released > >>under an > >> open source license, and several of them also have experience developing > >> code in an open source environment. Though the core set of Developers do > >> not have Apache Open Source experience, there are plans to onboard > >> individuals with Apache open source experience on to the project. Apache > >> Kylin PMC members are also in the same ebay organization. We work very > >> closely with Apache Ranger committers and are looking forward to find > >> meaningful integrations to improve the security of hadoop platform. > >> > >> Homogenous Developers > >> The core developers are from eBay. Today the problem of monitoring data > >> activities to find and stop threats is a universal problem faced by all > >>the > >> businesses. Apache Incubation process encourages an open and diverse > >> meritocratic community. Eagle intends to make every possible effort to > >> build a diverse, vibrant and involved community and has already received > >> substantial interest from various organizations. > >> > >> Reliance on Salaried Developers > >> eBay invested in Eagle as the monitoring solution for Hadoop clusters > >>and > >> some of its key engineers are working full time on the project. In > >> addition, since there is a growing need for securing sensitive data > >>access > >> we need a data activity monitoring solution for Hadoop, we look forward > >>to > >> other Apache developers and researchers to contribute to the project. > >> Additional contributors, including Apache committers have plans to join > >> this effort shortly. Also key to addressing the risk associated with > >> relying on Salaried developers from a single entity is to increase the > >> diversity of the contributors and actively lobby for Domain experts in > >>the > >> security space to contribute. Eagle intends to do this. > >> > >> Relationships with Other Apache Products > >> Eagle has a strong relationship and dependency with Apache Hadoop, > >>HBase, > >> Spark, Kafka and Storm. Being part of Apache¹s Incubation community, > >>could > >> help with a closer collaboration among these projects and as well as > >> others. An Excessive Fascination with the Apache Brand Eagle is > >>proposing > >> to enter incubation at Apache in order to help efforts to diversify the > >> committer-base, not so much to capitalize on the Apache brand. The Eagle > >> project is in production use already inside eBay, but is not expected > >>to be > >> an eBay product for external customers. As such, the Eagle project is > >>not > >> seeking to use the Apache brand as a marketing tool. > >> > >> Documentation > >> Information about Eagle can be found at https://github.com/eBay/Eagle. > >> The following link provide more information about Eagle > >>http://goeagle.io. > >> > >> Initial Source > >> Eagle has been under development since 2014 by a team of engineers at > >>eBay > >> Inc. It is currently hosted on Github.com under an Apache license 2.0 at > >> https://github.com/eBay/Eagle. Once in incubation we will be moving the > >> code base to apache git library. > >> > >> External Dependencies > >> Eagle has the following external dependencies. > >> Basic > >> €JDK 1.7+ > >> €Scala 2.10.4 > >> €Apache Maven > >> €JUnit > >> €Log4j > >> €Slf4j > >> €Apache Commons > >> €Apache Commons Math3 > >> €Jackson > >> €Siddhi CEP engine > >> > >> Hadoop > >> €Apache Hadoop > >> €Apache HBase > >> €Apache Hive > >> €Apache Zookeeper > >> €Apache Curator > >> > >> Apache Spark > >> €Spark Core Library > >> > >> REST Service > >> €Jersey > >> > >> Query > >> €Antlr > >> > >> Stream processing > >> €Apache Storm > >> €Apache Kafka > >> > >> Web > >> €AngularJS > >> €jQuery > >> €Bootstrap V3 > >> €Moment JS > >> €Admin LTE > >> €html5shiv > >> €respond > >> €Fastclick > >> €Date Range Picker > >> €Flot JS > >> > >> Cryptography > >> Eagle will eventually support encryption on the wire. This is not one of > >> the initial goals, and we do not expect Eagle to be a controlled export > >> item due to the use of encryption. Eagle supports but does not require > >>the > >> Kerberos authentication mechanism to access secured Hadoop services. > >> > >> Required Resources > >> > >> Mailing List > >> €eagle-private for private PMC discussions > >> €eagle-dev for developers > >> €eagle-commits for all commits > >> €eagle-users for all eagle users > >> > >> Subversion Directory > >> €Git is the preferred source control system. > >> > >> Issue Tracking > >> €JIRA Eagle (Eagle) > >> > >> Other Resources > >> The existing code already has unit tests so we will make use of existing > >> Apache continuous testing infrastructure. The resulting load should not > >>be > >> very large. > >> > >> Initial Committers > >> €Seshu Adunuthula <sadunuthula at ebay dot com> > >> €Arun Manoharan <armanoharan at ebay dot com> > >> €Edward Zhang <yonzhang at ebay dot com> > >> €Hao Chen <hchen9 at ebay dot com> > >> €Chaitali Gupta <cgupta at ebay dot com> > >> €Libin Sun <libsun at ebay dot com> > >> €Jilin Jiang <jiljiang at ebay dot com> > >> €Qingwen Zhao <qingwzhao at ebay dot com> > >> €Hemanth Dendukuri <hdendukuri at ebay dot com> > >> €Senthil Kumar <senthilkumar at ebay dot com> > >> €Tan Chen <tanchen at ebay dot com> > >> > >> Affiliations > >> The initial committers are employees of eBay Inc. > >> > >> Sponsors > >> > >> Champion > >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > >> > >> Nominated Mentors > >> €Owen O¹Malley < omalley at apache dot org > - Apache IPMC member, > >> Hortonworks > >> €Henry Saputra <hsaputra at apache dot org> - Apache IPMC member > >> €Julian Hyde <jhyde at hortonworks dot com> - Apache IPMC member, > >> Hortonworks > >> > >> Sponsoring Entity > >> We are requesting the Incubator to sponsor this project. > >> > >> > >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org > For additional commands, e-mail: general-h...@incubator.apache.org > > -- Best Regards, -- Alex