Re: DataSketches Proposal

Kenneth Knowles Sun, 24 Feb 2019 19:34:10 -0800

Can you share the Google doc with the proposal? Per Ted's advice, we can
iterate quickly there and move it to the wiki when it becomes a bit more
stable.


Kenn

On Fri, Feb 22, 2019 at 10:21 PM lee...@gmail.com <lee...@gmail.com> wrote:

> Thanks for the offer.  i am a neophyte at this process and email app!   I
> could use a lot of help getting this off the ground!  Also, I'm not sure
> that Mr. Chen and Mr. Onofré have fully accepted taking this on :)
>
> Lee.
>
> On 2019/02/23 06:03:58, Kenneth Knowles <k...@apache.org> wrote:
> > Nice.
> >
> > I would very much like to help mentor this project, though you already
> have
> > a couple good ones.
> >
> > I concur with incubator as sponsoring entity.
> >
> > Kenn (VP Apache Beam)
> >
> > On Fri, Feb 22, 2019 at 9:45 PM leerho <lee...@gmail.com> wrote:
> >
> > > I didn't realize that this mail list does not accept PDF files,
> apparently
> > > only text.  So let me try one more time ... :)  Please let me know if
> > > this works!
> > >
> > >
> > > = Apache DataSketches Proposal[1] =
> > >
> > > == Abstract ==
> > >
> > > DataSketches.GitHub.io is an open source, high-performance library of
> > > stochastic streaming algorithms commonly called "sketches" in the data
> > > sciences. Sketches are small, stateful programs that process massive
> data
> > > as a stream and can provide approximate answers, with mathematical
> > > guarantees, to computationally difficult queries orders-of-magnitude
> faster
> > > than traditional, exact methods.
> > >
> > > This proposal is to move DataSketches to the Apache Software
> > > Foundation(ASF) transferring ownership of its copyright intellectual
> > > property to the ASF.  Thereafter, DataSketches would be officially
> known as
> > > Apache DataSketches and its evolution and governance would come under
> the
> > > rules and guidance of the ASF.
> > >
> > > == Introduction ==
> > >
> > > The DataSketches library contains carefully crafted implementations of
> > > sketch algorithms that meet rigorous standards of quality and
> performance
> > > and provide capabilities required for large-scale production systems
> that
> > > must process and analyze massive data. The DataSketches core
> repository is
> > > written in Java with a parallel core repository written in C++ that
> > > includes Python wrappers. The DataSketches library also includes
> special
> > > repositories for extending the core library for Apache Hive and Apache
> Pig.
> > > The sketches developed in the different languages share a common binary
> > > storage format so that sketches created and stored in Java, for
> example,
> > > can be fully used in C++, and visa versa.  Because the stored sketch
> > > "images" are just a "blob" of bytes (similar to picture images), they
> can
> > > be shared across many different systems, languages and platforms.
> > >
> > > The DataSketches documentation website, https://datasketches.github.io
> ,
> > > includes general tutorials, a comprehensive research section with
> > > references to relevant academic papers, extensive examples for using
> the
> > > core library directly as well as examples for accessing the library in
> > > Hive, Pig, and Apache Spark.
> > >
> > > The DataSketches library also includes a characterization repository
> for
> > > long running test programs that are used for studying accuracy and
> > > performance of these sketches over wide ranges of input variables. The
> data
> > > produced by these programs is used for generating the many performance
> > > plots contained in the documentation website and for academic
> > > publications.
> > >
> > > The code repositories used for production are versioned and published
> to
> > > Maven Central on periodic intervals as the library evolves.
> > >
> > > The DataSketches library also includes several experimental
> repositories
> > > for use-cases outside the large-scale systems environments, such as
> > > sketches for mobile, IoT devices (Android), command-line access of the
> > > sketch library, and an experimental repository for vector-based
> sketches
> > > that performs approximate Singular Value Decomposition (SVD) analysis
> that
> > > could potentially be used in Machine Learning (ML) applications.
> > >
> > > == Background ==
> > >
> > > The DataSketches library was started in 2012 as internal Yahoo project
> to
> > > dramatically reduce time and resources required for distinct (unique)
> > > counting.  An extensive search on the Internet at the time yielded a
> number
> > > of theoretical papers on stochastic streaming algorithms with
> pseudocode
> > > examples, but we did not find any usable open-source code of the
> quality we
> > > felt we needed for our internal production systems.  So we started a
> small
> > > project (one person) to develop our own sketches working directly from
> > > published theoretical papers.
> > >
> > > The DataSketches library was designed from the start with the
> objective of
> > > making these algorithms, usually only described in theoretical papers,
> > > easily accessible to systems developers for use in our internal
> production
> > > systems. By necessity, the code had to be of the highest quality and
> > > thoroughly tested. The wide variety of our internal production systems
> > > drove the requirement that the sketch implementations had to have an
> > > absolute minimum of external, run-time dependencies in order to
> simplify
> > > integration and troubleshooting.
> > >
> > > Our internal experiments demonstrated dramatic positive impact on the
> > > performance of our systems.  As a result, the DataSketches library
> quickly
> > > evolved to include different types of sketches for different types of
> > > queries, such as frequent-items (a.k.a, heavy-hitters) algorithms,
> > > quantile/histogram algorithms, and weighted and unweighted sampling
> > > algorithms.
> > >
> > > We quickly discovered that developing these sketch algorithms to be
> truly
> > > robust in production environments is quite difficult and requires deep
> > > understanding of the underlying mathematics and statistics as well as
> > > extensive experience in developing high quality code for 24/7
> production
> > > systems. This is a difficult combination of skills for any one
> organization
> > > to collect and maintain over time. It became clear that this technology
> > > needed a community larger than Yahoo to evolve.  In November, 2015,
> this
> > > factor, along with Yahoo’s strong experience and support of open
> source,
> > > led to the decision to open source this technology under an Apache 2.0
> > > license on GitHub. Since that time our community has expanded
> considerably
> > > and the key contributors to this effort includes leading research
> > > scientists from a number of universities as well as practitioners and
> > > researchers from a number of major corporations. The core of this
> group is
> > > very active as we meet weekly to discuss research directions and
> > > engineering priorities.
> > >
> > > It is important to note that our internal systems at Yahoo use the
> current
> > > public GitHub open source DataSketches library and not an internal
> version
> > > of the code.
> > >
> > > The close collaboration of scientific research and engineering
> development
> > > experience with actual massive-data processing systems has also
> produced
> > > new research publications in the field of stochastic streaming
> algorithms,
> > > for example:
> > >
> > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty, Lee
> Rhodes, and
> > > Justin Thaler. A high-performance algorithm for identifying frequent
> items
> > > in data streams. In ACM IMC 2017.
> > >
> > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin Thaler. A
> > > framework for estimating stream expression cardinalities. In *EDBT/ICDT
> > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > >
> > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient Frequent
> > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD Proceedings
> ‘16,
> > > pages 845-854, 2016.
> > >
> > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal quantile
> > > approximation in streams. In IEEE FOCS Proceedings ‘16, pages 71–78,
> 2016.
> > >
> > > * Kevin J Lang. Back to the future: an even more nearly optimal
> cardinality
> > > estimation algorithm. arXiv preprint https://arxiv.org/abs/1708.06839,
> > > 2017.
> > >
> > > * Edo Liberty. Simple and deterministic matrix sketching. In ACM KDD
> > > Proceedings ‘13, pages 581– 588, 2013.
> > >
> > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and Jonathan
> Ullman.
> > > Space lower bounds for itemset frequency sketches. In ACM PODS
> Proceedings
> > > ‘16, pages 441–454, 2016.
> > >
> > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler. Hierarchical
> > > heavy hitters with the space saving algorithm. In SIAM ALENEX
> Proceedings
> > > ‘12, pages 160–174, 2012.
> > >
> > > == The Rationale for Sketches ==
> > >
> > > In the analysis of big data there are often problem queries that don’t
> > > scale because they require huge compute resources and time to generate
> > > exact results. Examples include count distinct, quantiles, most
> frequent
> > > items, joins, matrix computations, and graph analysis.
> > >
> > > If we can loosen the requirement of “exact” results from our queries
> and be
> > > satisfied with approximate results, within some well understood bounds
> of
> > > error, there is an entire branch of mathematics and data science that
> has
> > > evolved around developing algorithms that can produce approximate
> results
> > > with mathematically well-defined error properties.
> > >
> > > With the additional requirements that these algorithms must be small
> > > (compared to the size of the input data), sublinear (the size of the
> sketch
> > > must grow at a slower rate than the size of the input stream),
> streaming
> > > (they can only touch each data item once), and mergeable (suitable for
> > > distributed processing), defines a class of algorithms that can be
> > > described as small, stochastic, streaming, sublinear mergeable
> algorithms,
> > > commonly called sketches (they also have other names, but we will use
> the
> > > term sketches from here on).
> > >
> > > To be truly streaming and be able to process data in a single pass,
> > > sketches must make absolute minimum assumptions about the input stream.
> > > This is critically important, as there is no “second chance” to
> process the
> > > data.
> > >
> > > For example, sketches should not make assumptions about the order of
> stream
> > > items, the stream length, the dynamic range of values, or the
> distribution
> > > of item occurrence frequencies. Sketches should be tolerant of NaNs,
> Nulls
> > > and empty objects. About the only thing that the sketch needs to know
> about
> > > the stream is how to extract items from it and what type the item is,
> e.g.,
> > > is it a numeric value or a string.
> > >
> > > As far as the sketch is concerned, the input stream is a sequence of
> items
> > > in some unknown random order with unknown random values.
> > >
> > > The sketch is essentially a complex state machine and combined with the
> > > random input stream defines a stochastic process. We then apply
> > > probabilistic methods to interpret the states of the stochastic
> process in
> > > order to extract useful information about the input stream itself. The
> > > resulting information will be approximate, but we also use additional
> > > probabilistic methods to extract an estimate of the likely probability
> > > distribution of error.
> > >
> > > There is a significant scientific contribution here that is defining
> the
> > > state machine, understanding the resulting stochastic process,
> developing
> > > the probabilistic methods, and proving mathematically, that it all
> works!
> > > This is why the scientific contributors to this project are a critical
> and
> > > strategic component to our success.  The development engineers
> translate
> > > the concepts of the proposed state machine and probabilistic methods
> into
> > > production-quality code. Even more important, they work closely with
> the
> > > scientists, feeding back system and user requirements, which leads not
> only
> > > to superior product design, but to new science as well.  A number of
> > > scientific papers our members have published (see above) is a direct
> result
> > > of this close collaboration.
> > >
> > > Because sketches are small they can be processed extremely fast, often
> many
> > > orders-of-magnitude faster than traditional exact computations. For
> > > interactive queries there may not be other viable alternatives, and in
> the
> > > case of real-time analysis, sketches are the only known solution.
> > >
> > > For any system that needs to extract useful information from massive
> data
> > > sketches are essential tools that should be tightly integrated into the
> > > system’s analysis capabilities. This technology has helped Yahoo
> > > successfully reduce data processing times from days to hours or
> minutes on
> > > a number of its internal platforms and has enabled subsecond queries on
> > > real-time platforms that would have been infeasible without sketches.
> > > The Rationale for Apache DataSketches
> > > Other open source implementations of sketch algorithms can be found on
> the
> > > Internet. However, we have not yet found any open source
> implementations
> > > that are as comprehensive, engineered with the quality required for
> > > production systems, and with usable and guaranteed error properties.
> Large
> > > Internet companies, such as Google and Facebook, have published papers
> on
> > > sketching, however, their implementations of their published
> algorithms are
> > > proprietary and not available as open source.
> > >
> > > The DataSketches library already provides integrations with a number of
> > > major Apache data processing platforms such as Apache Hive, Apache Pig,
> > > Apache Spark and Apache Druid, and is also integrated with a number of
> > > other open source data processing platforms such as Splice Machine,
> GCHQ
> > > Gaffer and PostgreSQL.
> > >
> > > We believe that having DataSketches as an Apache project will provide
> an
> > > immediate, worthwhile, and substantial contribution to the open source
> > > community, will have a better opportunity to provide a meaningful
> > > contribution to both the science and engineering of sketching
> algorithms,
> > > and integrate with other Apache projects.  In addition, this is a
> > > significant opportunity for Apache to be the "go-to" destination for
> users
> > > that want to leverage this exciting technology.
> > >
> > > == Initial Goals ==
> > >
> > > We are breaking our initial goals into short-term (2-6 months) and
> > > intermediate to long-term ( 6 months to 2 years):
> > >
> > > Our short-term goals include:
> > >
> > > * Understanding and adapting to the Apache development process and
> > > structures.
> > >
> > > * Start refactoring codebase and move various DataSketches repositories
> > > code to Apache Git repository.
> > >
> > > * Continue development of new features, functions, and fixes.
> > >
> > > * Specific sub-projects (e.g., C++ and Python) will continue to be
> > > developed and expanded.
> > >
> > >
> > > The intermediate to long term goals include:
> > >
> > > * Completing the design and implementation of the C++ sketches to
> > > complement what is already available in Java, and the Python wrappers
> of
> > > those C++ sketches.
> > >
> > > * Expanding the C++ build framework to include Windows and the popular
> > > Linux variants.
> > >
> > > * Continued engagement with the scientific research community on the
> > > development of new algorithms for computationally difficult problems
> that
> > > heretofore have not had a sketching solution.
> > >
> > > == Current Status ==
> > >
> > > The DataSketches GitHub project has been quite successful.  As of this
> > > writing (Feb, 2019) the number of downloads measured by the Nexus
> > > Repository Manager at https://oss.sonatype.org has grown by nearly a
> > > factor
> > > of 10 over the past year to about 55 thousand per month. The
> > > DataSketches/sketches-core repository has about 560 stars and 141
> forks,
> > > which is pretty good for a highly specialized library.
> > >
> > > === Development Practices ===
> > >
> > > ==== Source Control ====
> > >
> > > All of our developers have extensive experience with Git version
> control
> > > and follow accepted practices for use of Pull Requests (PRs), code
> reviews
> > > and commits to master, for example.
> > >
> > > ==== Testing ====
> > >
> > > Sketches, by their nature are probabilistic programs and don’t
> necessarily
> > > behave deterministically.  For some of the sketches we intentionally
> insert
> > > random noise into the code as this gives us the mathematical properties
> > > that we need to guarantee accuracy.  This can make the behavior of
> these
> > > algorithms quite unintuitive and provides significant challenges to the
> > > developer who wishes to test these algorithms for correctness. As a
> result,
> > > our testing strategy includes two major components: unit tests, and
> > > characterization tests.
> > >
> > > ===== Unit Testing =====
> > >
> > > Our unit tests are primarily quick tests to make sure that we exercise
> all
> > > critical paths in the code and that key branches are executed
> correctly. It
> > > is important that they execute relatively fast as they are generally
> run on
> > > every code build. The sketches-core repository alone has about 22
> thousand
> > > statements, over 1300 unit tests and code coverage of about 98.2% as
> > > measured by Atlassian/Clover.  It is our goal for all of our code
> > > repositories that are used in production that they have code coverage
> > > greater than 90%.
> > >
> > > ===== Characterization Testing =====
> > >
> > > In order to test the probabilistic methods that are used to interpret
> the
> > > stochastic behaviors of our sketches we have a separate
> characterization
> > > repository that is dedicated to this.  To measure accuracy, for
> example,
> > > requires running thousands of trials at each of many different points
> along
> > > the domain axis. Each trial compares its estimated results against a
> known
> > > exact result producing an error for that trial.  These error
> measurements
> > > are then fed into our Quantiles sketch to capture the actual
> distribution
> > > of error at that point along the axis. We then select quantile contours
> > > across all the distributions at points along the axis.  These contours
> can
> > > then be plotted to reveal the shape of the actual error distribution.
> These
> > > distributions are not at all Gaussian, in fact they can be quite
> complex.
> > > Nonetheless, these distributions are then checked against our
> statistical
> > > guarantees inherent to the specific sketch algorithm and its
> parameters.
> > > There are many examples of these characterization error distributions
> on
> > > our website. The runtimes of these tests can be very long and can range
> > > from many minutes to hours, and some can run for days.  Currently, we
> have
> > > separate characterization repositories for Java and C++ / Python.
> > >
> > > It is our goal that we perform this characterization analysis for all
> of
> > > our sketches.  By definition, the code that runs these characterization
> > > tests is open-source so others can run these tests as well.  We do not
> have
> > > formal releases of this code (because it is not production code) and
> it is
> > > not published to Maven Central.
> > >
> > > === Meritocracy ===
> > >
> > > DataSketches was initially developed based on requirements within
> Yahoo. As
> > > a project on GitHub, DataSketches has received contributions from
> numerous
> > > individual developers from around the world, dedicated research work
> from
> > > senior scientists at Amazon and Visa, and academic researchers from
> > > Georgetown University, Princeton, and MIT.
> > >
> > > As a project under incubation, we are committed to expanding our
> effort to
> > > build an environment which supports a meritocracy. We are focused on
> > > engaging the community and other related projects for support and
> > > contributions. Moreover, we are committed to ensure contributors and
> > > committers to DataSketches come from a broad mix of organizations
> through a
> > > merit-based decision process during incubation. We believe strongly in
> the
> > > DataSketches premise that fulfills the concept of a well engineered and
> > > scientifically rigorous library that implements these powerful
> algorithms
> > > and are committed to growing an inclusive community of DataSketches
> > > contributors and users.
> > >
> > > === Community ===
> > >
> > > Yahoo has a long history and active engagement in the Open Source
> > > community. Major projects include: Vespa.ai, Bullet, Moloch, Panoptes,
> > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel, TensorFlowOnSpark,
> gifshot,
> > > fluxible, as well as the creation, contribution and incubation of many
> > > Apache projects such as Apache Hadoop, Pig, Bookkeeper, Oozie,
> Zookeeper,
> > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > >
> > > Every day, DataSketches is actively used by a organizations and
> > > institutions around the world for batch and stream processing of data.
> We
> > > believe acceptance will allow us to consolidate existing
> > > DataSketches-related work, grow the DataSketches community, and deepen
> > > connections between DataSketches and other open source projects.
> > >
> > > === Introduction to the Core Developers & Contributors ===
> > >
> > > The core developers and contributors for DataSketches are from diverse
> > > backgrounds, but primarily are scientists that love engineering and
> > > engineers that love science. A large part of the value we bring comes
> from
> > > this synthesis.  These individuals have already contributed
> substantially
> > > to the code, algorithms, and/or mathematical proofs that form the
> basis of
> > > the library.
> > >
> > > This core group also form the Initial Committers with write
> permissions to
> > > the repository. Those marked with (*) Meet weekly to plan the research
> and
> > > engineering direction of the project.
> > >
> > > ==== Scientists That Love Engineering ====
> > >
> > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs, Israel.
> Interests:
> > > distributed systems, scalable systems and platforms for big data
> > > processing, concurrent algorithms and data structures,
> > >
> > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo Labs,
> Sunnyvale,
> > > California. Interests: algorithms, theoretical and applied mathematics,
> > > encoding and compression theory, theoretical and applied performance
> > > optimization.
> > >
> > > * Edo Liberty: (*) Director of Research, Head of Amazon AI Labs, Palo
> Alto,
> > > California. Manages the algorithms group at Amazon AI. We build
> scalable
> > > machine learning systems and algorithms which are used both internally
> and
> > > externally by customers of SageMaker, AWS's flagship machine learning
> > > platform.
> > >
> > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale. Interests:
> > > Computational advertising, machine learning, speech recognition,
> > > data-driven analysis, large scale experimentation, big data,
> stream/complex
> > > event processing
> > >
> > > * Justin Thaler: (*) Assistant Professor, Department of Computer
> Science,
> > > Georgetown University, Washington D.C. Interests: algorithms and
> > > computational complexity, complexity theory, quantum algorithms,
> private
> > > data analysis, and learning theory, developing efficient streaming and
> > > sketching algorithms
> > >
> > > ==== Engineers That Love Science ====
> > >
> > > * Roman Leventov: Senior Software Engineer,  Metamarkets / Snap.
> Interests:
> > > design and implementation of data storing and data processing
> (distributed)
> > > systems, performance optimization, CPU performance, mechanical
> sympathy,
> > > JVM performance, API design, databases, (concurrent) data structures,
> > > memory management, garbage collection algorithms, language design and
> > > runtimes (their tradeoffs), distributed systems (cloud) efficiency,
> Linux,
> > > code quality, code transformation, pure functional programming models,
> > > Haskell.
> > >
> > > * Lee Rhodes: (*) Distinguished Architect, lead developer and founder
> of
> > > the DataSketches project, Yahoo, Sunnyvale, California.  Interests:
> > > streaming algorithms, mathematics, computer science, high quality and
> high
> > > performance code for the analysis of massive data, bridging the divide
> > > between theory and practice.
> > >
> > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo, Sunnyvale,
> > > California. Interests: applied mathematics, computer science, big data,
> > > distributed systems.
> > >
> > > === Introduction to Additional Interested Contributors ===
> > >
> > > These folks have been intermittently involved and contributed, but are
> > > strong supporters of this project.
> > >
> > > * Frank Grimes: GitHub ID: frankgrimes97
> > >
> > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D. Computer
> Science,
> > > Univ of Utah. Interests: Machine Learning, Data Mining, matrix
> > > approximation, streaming algorithms, randomized linear algebra.
> > >
> > > * Christopher Musco: [christopher.musco at gmail dot com] Ph.D.
> Computer
> > > Science, Research Instructor, Princeton University. Interests:
> algorithmic
> > > foundations of data science and machine learning, efficient methods for
> > > processing and understanding large datasets, often working at the
> > > intersection of theoretical computer science, numerical linear
> algebra, and
> > > optimization.
> > >
> > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D. Computer
> Science,
> > > Professor, Warwick University, Warwick, England. Interests: all
> aspects of
> > > the "data lifecycle", from data collection and cleaning, through
> mining and
> > > analytics. (Professor Cormode is one of the world’s leading scientists
> in
> > > sketching algorithms)
> > >
> > > === Alignment ===
> > >
> > > The DataSketches library already provides integrations and example
> code for
> > > Apache Hive, Apache Pig, Apache Spark and is deeply integrated into
> Apache
> > > Druid.
> > >
> > > == Known Risks ==
> > >
> > > The following subsections are specific risks that have been identified
> by
> > > the ASF that need to be addressed.
> > >
> > > === Risk: Orphaned Products ===
> > >
> > > The DataSketches library is presently used by a number of
> organizations,
> > > from small startups to Fortune 100 companies, to construct production
> > > pipelines that must process and analyze massive data. Yahoo has a
> long-term
> > > commitment to continue to advance the DataSketches library; moreover,
> > > DataSketches is seeing increasing interest, development, and adoption
> from
> > > many diverse organizations from around the world. Due to its growing
> > > adoption, we feel it is quite unlikely that this project would become
> > > orphaned.
> > >
> > > === Risk: Inexperience with Open Source ===
> > >
> > > Yahoo believes strongly in open source and the exchange of information
> to
> > > advance new ideas and work. Examples of this commitment are active open
> > > source projects such as those mentioned above. With DataSketches, we
> have
> > > been increasingly open and forward-looking; we have published a number
> of
> > > papers about breakthrough developments in the science of streaming
> > > algorithms (mentioned above) that also reference the DataSketches
> library.
> > > Our submission to the Apache Software Foundation is a logical
> extension of
> > > our commitment to open source software.
> > >
> > > Key committers at Yahoo with strong open source backgrounds include
> Aaron
> > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia Braginsky, Andrews
> > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan Call, Daryn
> > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne, Eshcar Hillel,
> > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco Perez-Sorrosal,
> Gil
> > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher, James
> Penick,
> > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon Eagles,
> Kihwal
> > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael Trelinski,
> > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L. Natkovich,
> > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy, Ruby Loo,
> Ryan
> > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu Kit Chan, Sri
> > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many more.
> > >
> > > All of our core developers are committed to learn about the Apache
> process
> > > and to give back to the community.
> > >
> > > === Risk: Homogeneous Developers ===
> > >
> > > The majority of committers in this proposal belong to Yahoo due to the
> fact
> > > that DataSketches has emerged from an internal Yahoo project. This
> proposal
> > > also includes developers and contributors from other companies, and
> who are
> > > actively involved with other Apache projects, such as Druid.  We
> expect our
> > > entry into incubation will allow us to expand the number of
> individuals and
> > > organizations participating in DataSketches development.
> > >
> > > === Risk: Reliance on Salaried Developers ===
> > >
> > > Because the DataSketches library originated within Yahoo, it has been
> > > developed primarily by salaried Yahoo developers and we expect that to
> > > continue to be the case near term. However, since we placed this
> library
> > > into open-source we have had a number of significant contributions from
> > > engineers and scientists from outside of Yahoo. We expect our reliance
> on
> > > Yahoo salaried developers will decrease over time. Nonetheless, Yahoo
> is
> > > committed to continue its strong support of this important project.
> > >
> > > === Risk: Lack of Relationship to other Apache Products ===
> > >
> > > DataSketches already directly interoperates with or utilizes several
> > > existing Apache projects.
> > >
> > > * Build
> > >    * Apache Maven
> > >
> > > * Integrations and adaptors for the following projects naturally have
> them
> > > as dependencies
> > >    * Apache Hive
> > >    * Apache Pig
> > >    * Apache Druid
> > >    * Apache Spark
> > >
> > > * Additional dependencies for the above integrations and adaptors
> include
> > >    * Apache Hadoop
> > >    * Apache Commons (Math)
> > >
> > > There is no other Apache project that we are aware of that duplicates
> the
> > > functionality of the DataSketches library.
> > >
> > > === Risk: An Excessive Fascination with the Apache Brand ===
> > >
> > > With this proposal we are not seeking attention or publicity. Rather,
> we
> > > firmly believe in the DataSketches library and concept and the ability
> to
> > > make the DataSketches library a powerful, yet simple-to-use toolkit for
> > > data processing. While the DataSketches library has been open source,
> we
> > > believe putting code on GitHub can only go so far. We see the Apache
> > > community, processes, and mission as critical for ensuring the
> DataSketches
> > > library is truly community-driven, positively impactful, and innovative
> > > open source software. While Yahoo has taken a number of steps to
> advance
> > > its various open source projects, we believe the DataSketches library
> > > project is a great fit for the Apache Software Foundation due to its
> focus
> > > on data processing and its relationships to existing ASF projects.
> > >
> > > === Risk: Cryptography ===
> > >
> > > DataSketches does not contain any cryptographic code and is not a
> > > cryptographic product.
> > >
> > > == Documentation ==
> > >
> > > The following documentation is relevant to this proposal. Relevant
> portions
> > > of the documentation will be contributed to the Apache DataSketches
> > > project.
> > >
> > > * DataSketches website: https://datasketches.github.io.
> > >
> > > * DataSketches website repository:
> > > https://github.com/DataSketches/DataSketches.github.io
> > >
> > > We will need an apache website for this documentation similar to
> > >
> > > * https://datasketches.apache.org
> > >
> > > == Initial Source ==
> > >
> > > The initial source for DataSketches which we will submit to the Apache
> > > Foundation will include a number of repositories which are currently
> hosted
> > > under the GitHub.com/datasketches organization:
> > >
> > > All github.com/datasketches repositories including:
> > >
> > > * Java
> > >    * sketches-core: This repository has the core sketching classes,
> which
> > > are leveraged by some of the other repositories. This repository has no
> > > external dependencies outside of the DataSketches/memory repository,
> Java
> > > and TestNG for unit tests. This code is versioned and the latest
> release
> > > can be obtained from Maven Central.
> > >    * memory: Low level, high-performance memory data-structure
> management
> > > primarily for off-heap.
> > >    * sketches-android: This is a new repository dedicated to sketches
> > > designed to be run in a mobile client, such as a cell phone. It is
> still in
> > > development and should be considered experimental.
> > >    * sketches-hive: This repository contains Hive UDFs and UDAFs for
> use
> > > within Hadoop grid environments. This code has dependencies on
> > > sketches-core as well as Hadoop and Hive. Users of this code are
> advised to
> > > use Maven to bring in all the required dependencies. This code is
> versioned
> > > and the latest release can be obtained from Maven Central.
> > >    * sketches-pig: This repository contains Pig User Defined Functions
> > > (UDF) for use within Hadoop grid environments. This code has
> dependencies
> > > on sketches-core as well as Hadoop and Pig. Users of this code are
> advised
> > > to use Maven to bring in all the required dependencies. This code is
> > > versioned and the latest release can be obtained from Maven Central.
> > >    * sketches-vector: This is a new repository dedicated to sketches
> for
> > > vector and matrix operations. It is still somewhat experimental.
> > >    * characterization: This relatively new repository is for code that
> we
> > > use to characterize the accuracy and speed performance of the sketches
> in
> > > the library and is constantly being updated. Examples of the job
> command
> > > files used for various tests can be found in the src/main/resources
> > > directory. Some of these tests can run for hours depending on its
> > > configuration.
> > >    * experimental: This repository is an experimental staging area for
> code
> > > that will eventually end up in another repository. This code is not
> > > versioned and not registered with Maven Central.
> > >    * sketches-misc: Demos and other code not related to production
> > > deployment
> > >
> > > * C++ and Python
> > >    * sketches-core-cpp: This is the C++/Python companion to the Java
> > > sketches-core. These implementations are binary compatible with their
> > > counterparts in Java. In other words, a sketch created and stored in
> C++
> > > can be opened and read in Java and visa-versa. This site also has our
> > > Python adaptors that basically wrap the C++ implementations, making the
> > > high performance C++ implementations available from Python.
> > >    * sketches-postgres: This site provides the postgres-specific
> adaptors
> > > that wrap the C++ implementations making them available to the Postgres
> > > database users.
> > >    * characterization-cpp: This is the C++/Python companion to the Java
> > > characterization repository.
> > >    * experimental-cpp: This repository is an experimental staging area
> for
> > > C++ code that will eventually end up in another repository.
> > >
> > > * Command-Line Tools
> > >    * sketches-cmd
> > >    * homebrew-sketches
> > >    * homebrew-sketches-cmd
> > >
> > > These projects have always been Apache 2.0 licensed. We intend to
> bundle
> > > all of these repositories since they are all complementary and should
> be
> > > maintained in one project. Prior to our submission, we will combine
> all of
> > > these projects into a new git repository.
> > >
> > > == Source and Intellectual Property Submission Plan ==
> > >
> > > Contributors to the DataSketches project have also signed the Yahoo
> > > Individual Contributor License Agreement (
> https://yahoocla.herokuapp.com/
> > > in order to contribute to the project.
> > >
> > > With respect to trademark rights, Yahoo does not hold a trademark on
> the
> > > phrase “DataSketches.” Based on feedback and guidance we receive
> during the
> > > incubation process, we are open to renaming the project if necessary
> for
> > > trademark or other concerns, but we would prefer not to have to do
> that.
> > >
> > > == External Dependencies ==
> > >
> > > All external dependencies are licensed under an Apache 2.0 or
> > > Apache-compatible license. As we grow the DataSketches community we
> will
> > > configure our build process to require and validate all contributions
> and
> > > dependencies are licensed under the Apache 2.0 license or are under an
> > > Apache-compatible license.
> > >
> > > == Required Resources ==
> > >
> > > === Mailing Lists ===
> > >
> > > We currently use a mix of mailing lists. We will migrate our existing
> > > mailing lists to the following:
> > >
> > > * d...@datasketches.incubator.apache.org
> > >
> > > * u...@datasketches.incubator.apache.org
> > >
> > > * priv...@datasketches.incubator.apache.org
> > >
> > > * comm...@datasketches.incubator.apache.org
> > >
> > > === Source Control ===
> > >
> > > The DataSketches team currently uses Git and would like to continue to
> do
> > > so. We request a Git repository for DataSketches with mirroring to
> GitHub
> > > enabled similar the following:
> > >
> > > * https://github.com/apache/incubator-datasketches.git
> > >
> > > === Issue Tracking ===
> > >
> > > We request the creation of an Apache-hosted JIRA. The DataSketches
> project
> > > is currently using the public GitHub issue tracker and the public
> Google
> > > Groups forum/sketches-user for issue tracking and discussions. We will
> > > migrate and combine from these two sources to the Apache JIRA.
> > >
> > > Proposed Jira ID: DATASKETCHES
> > >
> > > == Initial Committers ==
> > >
> > > The following list of individuals have been extremely active in our
> > > community and should have write (commit) permissions to the repository.
> > >
> > > * Eshcar Hillel                      [eshcar at verizonmedia dot com]
> > >
> > > * Kevin Lang                    [langk at verizonmedia dot com]
> > >
> > > * Roman Leventov              [roman.leventov at c.metamarkets dot com]
> > >
> > > * Edo Liberty                   [libertye at amazon dot com]
> > >
> > > * Jon Malkin                    [jmalkin at verizonmedia dot com]
> > >
> > > * Lee Rhodes                  [lrhodes at verizonmedia dot com] &
> [leerho
> > > at gmail dot com]
> > >
> > > * Alexander Saydakov         [saydakov at verizonmedia dot com]
> > >
> > > * Justin Thaler                 [justin.thaler at georgetown dot edu]
> > >
> > > == Affiliations ==
> > >
> > > The initial committers are from four organizations: Yahoo, Amazon,
> > > Georgetown University, and Metamarkets/Snap.
> > >
> > > === Champion ===
> > > (Recommended to me: )
> > >
> > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> apache
> > > dot org]
> > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > >
> > > === Nominated Mentors ===
> > > (Recommended to me: )
> > >
> > > Liang Chen, Vice President of Apache CarbonData, [chenliang613 at
> apache
> > > dot org]
> > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > Gil Yehuda, gyehuda at verizonmedia dot com
> > >
> > > === Sponsoring Entity ===
> > >
> > > * The Apache Incubator    **** This is our 1st choice ****
> > >
> > > * Apache Druid. The incubating Apache Druid project might also be a
> logical
> > > sponsor. However, DataSketches has applications in many areas of
> computing
> > > outside of Druid so our preference and recommendation is that
> DataSketches
> > > would ultimately be a top-level Apache project.
> > >
> > > ________________
> > > [1] In 2017 Verizon acquired Yahoo and merged it with previously
> acquired
> > > AOL. The merged entity was originally called Oath, Inc., but has
> recently
> > > been renamed Verizon Media, Inc., a wholly-owned subsidiary of Verizon,
> > > Inc.  Since Yahoo is the more recognized name, references in this
> document
> > > to Yahoo, are also a reference to Verizon Media, Inc.
> > >
> > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <k...@apache.org>
> wrote:
> > >
> > > > The subject line has me interested already. Follow examples like this
> > > > maybe?
> > > >
> > > > 1.
> > > >
> > > >
> > >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > 2.
> > > >
> > > >
> > >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > >
> > > > Kenn
> > > >
> > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <lee...@gmail.com> wrote:
> > > >
> > > > > I'll try again ... :)
> > > > >
> > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <ted.dunn...@gmail.com
> >
> > > > wrote:
> > > > >
> > > > >> It didn't make it again
> > > > >>
> > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <lee...@gmail.com> wrote:
> > > > >>
> > > > >> > I'm not sure the attached document made it through.
> > > > >> >
> > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <lee...@gmail.com>
> wrote:
> > > > >> >
> > > > >> > >
> > > > >> > >
> > > > >> >
> > > > >>
> > > > >
> > > > >
> ---------------------------------------------------------------------
> > > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > > > For additional commands, e-mail: general-h...@incubator.apache.org
> > > >
> > >
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> For additional commands, e-mail: general-h...@incubator.apache.org
>
>

Re: DataSketches Proposal

Reply via email to