Re: DataSketches Proposal - Google Docs Link

leerho Mon, 25 Feb 2019 21:32:14 -0800

Yes, I thought of that.  But it’s not like I’m being overwhelmed with
requests to comment ... so far it has been only 3 or 4, and the requested
changes have been minor.  I’m assuming that if there are no more
substantive changes after this week that the document would be moved to the
wiki archive, where, I presume, changes could still be made.


I want to do the right thing here, so if you feel that the document would
get much better feedback on an unrestricted gDoc site, I will set it up.



On Mon, Feb 25, 2019 at 8:32 PM Jim Apple <jbap...@cloudera.com.invalid>
wrote:

> You could use a Google account that is not under Yahoo’s control, then let
> anyone in the world add a comment, maybe.
>
> On Mon, Feb 25, 2019 at 3:26 PM leerho <lee...@gmail.com> wrote:
>
> > Ken,
> > Yahoo does not allow me to create a shared link outside our company,
> except
> > to individual email addresses.  So attempting to share it to the email
> > general@incubator.apache.org may not work.  Nonetheless, several
> > individuals were able to request access using their individual email
> > accounts and I was able to add them.  I will try to add you using
> > k...@apache.org, but if that doesn't work, I may need a gmail or
> > equivalent
> > account for you.
> >
> > Lee.
> >
> >
> > On Mon, Feb 25, 2019 at 2:59 PM Kenneth Knowles <k...@apache.org> wrote:
> >
> > > I could not access that document. I suggest you need to turn on link
> > > sharing.
> > >
> > > Kenn
> > >
> > > On Mon, Feb 25, 2019 at 12:00 PM lee...@gmail.com <lee...@gmail.com>
> > > wrote:
> > >
> > > > Try this link:
> > > >
> > >
> >
> https://docs.google.com/document/d/19JKevzFQNcaLA51LFLUlP1hzdFDW7oDJrJO8N6weDv8/edit?usp=sharing
> > > >
> > > >
> > > > On 2019/02/25 05:55:50, leerho <lee...@gmail.com> wrote:
> > > > > Yes I will try that tomorrow.
> > > > >
> > > > > On Sun, Feb 24, 2019 at 7:34 PM Kenneth Knowles <k...@apache.org>
> > > wrote:
> > > > >
> > > > > > Can you share the Google doc with the proposal? Per Ted's advice,
> > we
> > > > can
> > > > > > iterate quickly there and move it to the wiki when it becomes a
> bit
> > > > more
> > > > > > stable.
> > > > > >
> > > > > > Kenn
> > > > > >
> > > > > > On Fri, Feb 22, 2019 at 10:21 PM lee...@gmail.com <
> > lee...@gmail.com>
> > > > > > wrote:
> > > > > >
> > > > > > > Thanks for the offer.  i am a neophyte at this process and
> email
> > > > app!   I
> > > > > > > could use a lot of help getting this off the ground!  Also, I'm
> > not
> > > > sure
> > > > > > > that Mr. Chen and Mr. Onofré have fully accepted taking this on
> > :)
> > > > > > >
> > > > > > > Lee.
> > > > > > >
> > > > > > > On 2019/02/23 06:03:58, Kenneth Knowles <k...@apache.org>
> wrote:
> > > > > > > > Nice.
> > > > > > > >
> > > > > > > > I would very much like to help mentor this project, though
> you
> > > > already
> > > > > > > have
> > > > > > > > a couple good ones.
> > > > > > > >
> > > > > > > > I concur with incubator as sponsoring entity.
> > > > > > > >
> > > > > > > > Kenn (VP Apache Beam)
> > > > > > > >
> > > > > > > > On Fri, Feb 22, 2019 at 9:45 PM leerho <lee...@gmail.com>
> > wrote:
> > > > > > > >
> > > > > > > > > I didn't realize that this mail list does not accept PDF
> > files,
> > > > > > > apparently
> > > > > > > > > only text.  So let me try one more time ... :)  Please let
> me
> > > > know if
> > > > > > > > > this works!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > = Apache DataSketches Proposal[1] =
> > > > > > > > >
> > > > > > > > > == Abstract ==
> > > > > > > > >
> > > > > > > > > DataSketches.GitHub.io is an open source, high-performance
> > > > library
> > > > > > of
> > > > > > > > > stochastic streaming algorithms commonly called "sketches"
> in
> > > the
> > > > > > data
> > > > > > > > > sciences. Sketches are small, stateful programs that
> process
> > > > massive
> > > > > > > data
> > > > > > > > > as a stream and can provide approximate answers, with
> > > > mathematical
> > > > > > > > > guarantees, to computationally difficult queries
> > > > orders-of-magnitude
> > > > > > > faster
> > > > > > > > > than traditional, exact methods.
> > > > > > > > >
> > > > > > > > > This proposal is to move DataSketches to the Apache
> Software
> > > > > > > > > Foundation(ASF) transferring ownership of its copyright
> > > > intellectual
> > > > > > > > > property to the ASF.  Thereafter, DataSketches would be
> > > > officially
> > > > > > > known as
> > > > > > > > > Apache DataSketches and its evolution and governance would
> > come
> > > > under
> > > > > > > the
> > > > > > > > > rules and guidance of the ASF.
> > > > > > > > >
> > > > > > > > > == Introduction ==
> > > > > > > > >
> > > > > > > > > The DataSketches library contains carefully crafted
> > > > implementations
> > > > > > of
> > > > > > > > > sketch algorithms that meet rigorous standards of quality
> and
> > > > > > > performance
> > > > > > > > > and provide capabilities required for large-scale
> production
> > > > systems
> > > > > > > that
> > > > > > > > > must process and analyze massive data. The DataSketches
> core
> > > > > > > repository is
> > > > > > > > > written in Java with a parallel core repository written in
> > C++
> > > > that
> > > > > > > > > includes Python wrappers. The DataSketches library also
> > > includes
> > > > > > > special
> > > > > > > > > repositories for extending the core library for Apache Hive
> > and
> > > > > > Apache
> > > > > > > Pig.
> > > > > > > > > The sketches developed in the different languages share a
> > > common
> > > > > > binary
> > > > > > > > > storage format so that sketches created and stored in Java,
> > for
> > > > > > > example,
> > > > > > > > > can be fully used in C++, and visa versa.  Because the
> stored
> > > > sketch
> > > > > > > > > "images" are just a "blob" of bytes (similar to picture
> > > images),
> > > > they
> > > > > > > can
> > > > > > > > > be shared across many different systems, languages and
> > > platforms.
> > > > > > > > >
> > > > > > > > > The DataSketches documentation website,
> > > > > > https://datasketches.github.io
> > > > > > > ,
> > > > > > > > > includes general tutorials, a comprehensive research
> section
> > > with
> > > > > > > > > references to relevant academic papers, extensive examples
> > for
> > > > using
> > > > > > > the
> > > > > > > > > core library directly as well as examples for accessing the
> > > > library
> > > > > > in
> > > > > > > > > Hive, Pig, and Apache Spark.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes a characterization
> > > > repository
> > > > > > > for
> > > > > > > > > long running test programs that are used for studying
> > accuracy
> > > > and
> > > > > > > > > performance of these sketches over wide ranges of input
> > > > variables.
> > > > > > The
> > > > > > > data
> > > > > > > > > produced by these programs is used for generating the many
> > > > > > performance
> > > > > > > > > plots contained in the documentation website and for
> academic
> > > > > > > > > publications.
> > > > > > > > >
> > > > > > > > > The code repositories used for production are versioned and
> > > > published
> > > > > > > to
> > > > > > > > > Maven Central on periodic intervals as the library evolves.
> > > > > > > > >
> > > > > > > > > The DataSketches library also includes several experimental
> > > > > > > repositories
> > > > > > > > > for use-cases outside the large-scale systems environments,
> > > such
> > > > as
> > > > > > > > > sketches for mobile, IoT devices (Android), command-line
> > access
> > > > of
> > > > > > the
> > > > > > > > > sketch library, and an experimental repository for
> > vector-based
> > > > > > > sketches
> > > > > > > > > that performs approximate Singular Value Decomposition
> (SVD)
> > > > analysis
> > > > > > > that
> > > > > > > > > could potentially be used in Machine Learning (ML)
> > > applications.
> > > > > > > > >
> > > > > > > > > == Background ==
> > > > > > > > >
> > > > > > > > > The DataSketches library was started in 2012 as internal
> > Yahoo
> > > > > > project
> > > > > > > to
> > > > > > > > > dramatically reduce time and resources required for
> distinct
> > > > (unique)
> > > > > > > > > counting.  An extensive search on the Internet at the time
> > > > yielded a
> > > > > > > number
> > > > > > > > > of theoretical papers on stochastic streaming algorithms
> with
> > > > > > > pseudocode
> > > > > > > > > examples, but we did not find any usable open-source code
> of
> > > the
> > > > > > > quality we
> > > > > > > > > felt we needed for our internal production systems.  So we
> > > > started a
> > > > > > > small
> > > > > > > > > project (one person) to develop our own sketches working
> > > directly
> > > > > > from
> > > > > > > > > published theoretical papers.
> > > > > > > > >
> > > > > > > > > The DataSketches library was designed from the start with
> the
> > > > > > > objective of
> > > > > > > > > making these algorithms, usually only described in
> > theoretical
> > > > > > papers,
> > > > > > > > > easily accessible to systems developers for use in our
> > internal
> > > > > > > production
> > > > > > > > > systems. By necessity, the code had to be of the highest
> > > quality
> > > > and
> > > > > > > > > thoroughly tested. The wide variety of our internal
> > production
> > > > > > systems
> > > > > > > > > drove the requirement that the sketch implementations had
> to
> > > > have an
> > > > > > > > > absolute minimum of external, run-time dependencies in
> order
> > to
> > > > > > > simplify
> > > > > > > > > integration and troubleshooting.
> > > > > > > > >
> > > > > > > > > Our internal experiments demonstrated dramatic positive
> > impact
> > > > on the
> > > > > > > > > performance of our systems.  As a result, the DataSketches
> > > > library
> > > > > > > quickly
> > > > > > > > > evolved to include different types of sketches for
> different
> > > > types of
> > > > > > > > > queries, such as frequent-items (a.k.a, heavy-hitters)
> > > > algorithms,
> > > > > > > > > quantile/histogram algorithms, and weighted and unweighted
> > > > sampling
> > > > > > > > > algorithms.
> > > > > > > > >
> > > > > > > > > We quickly discovered that developing these sketch
> algorithms
> > > to
> > > > be
> > > > > > > truly
> > > > > > > > > robust in production environments is quite difficult and
> > > requires
> > > > > > deep
> > > > > > > > > understanding of the underlying mathematics and statistics
> as
> > > > well as
> > > > > > > > > extensive experience in developing high quality code for
> 24/7
> > > > > > > production
> > > > > > > > > systems. This is a difficult combination of skills for any
> > one
> > > > > > > organization
> > > > > > > > > to collect and maintain over time. It became clear that
> this
> > > > > > technology
> > > > > > > > > needed a community larger than Yahoo to evolve.  In
> November,
> > > > 2015,
> > > > > > > this
> > > > > > > > > factor, along with Yahoo’s strong experience and support of
> > > open
> > > > > > > source,
> > > > > > > > > led to the decision to open source this technology under an
> > > > Apache
> > > > > > 2.0
> > > > > > > > > license on GitHub. Since that time our community has
> expanded
> > > > > > > considerably
> > > > > > > > > and the key contributors to this effort includes leading
> > > research
> > > > > > > > > scientists from a number of universities as well as
> > > > practitioners and
> > > > > > > > > researchers from a number of major corporations. The core
> of
> > > this
> > > > > > > group is
> > > > > > > > > very active as we meet weekly to discuss research
> directions
> > > and
> > > > > > > > > engineering priorities.
> > > > > > > > >
> > > > > > > > > It is important to note that our internal systems at Yahoo
> > use
> > > > the
> > > > > > > current
> > > > > > > > > public GitHub open source DataSketches library and not an
> > > > internal
> > > > > > > version
> > > > > > > > > of the code.
> > > > > > > > >
> > > > > > > > > The close collaboration of scientific research and
> > engineering
> > > > > > > development
> > > > > > > > > experience with actual massive-data processing systems has
> > also
> > > > > > > produced
> > > > > > > > > new research publications in the field of stochastic
> > streaming
> > > > > > > algorithms,
> > > > > > > > > for example:
> > > > > > > > >
> > > > > > > > > * Daniel Anderson, Pryce Bevan, Kevin J. Lang, Edo Liberty,
> > Lee
> > > > > > > Rhodes, and
> > > > > > > > > Justin Thaler. A high-performance algorithm for identifying
> > > > frequent
> > > > > > > items
> > > > > > > > > in data streams. In ACM IMC 2017.
> > > > > > > > >
> > > > > > > > > * Anirban Dasgupta, Kevin J. Lang, Lee Rhodes, and Justin
> > > > Thaler. A
> > > > > > > > > framework for estimating stream expression cardinalities.
> In
> > > > > > *EDBT/ICDT
> > > > > > > > > Proceedings ‘16 *, pages 6:1–6:17, 2016.
> > > > > > > > >
> > > > > > > > > * Mina Ghashami, Edo Liberty, Jeff M. Phillips. Efficient
> > > > Frequent
> > > > > > > > > Directions Algorithm for Sparse Matrices. In ACM SIGKDD
> > > > Proceedings
> > > > > > > ‘16,
> > > > > > > > > pages 845-854, 2016.
> > > > > > > > >
> > > > > > > > > * Zohar S. Karnin, Kevin J. Lang, and Edo Liberty. Optimal
> > > > quantile
> > > > > > > > > approximation in streams. In IEEE FOCS Proceedings ‘16,
> pages
> > > > 71–78,
> > > > > > > 2016.
> > > > > > > > >
> > > > > > > > > * Kevin J Lang. Back to the future: an even more nearly
> > optimal
> > > > > > > cardinality
> > > > > > > > > estimation algorithm. arXiv preprint
> > > > > > https://arxiv.org/abs/1708.06839,
> > > > > > > > > 2017.
> > > > > > > > >
> > > > > > > > > * Edo Liberty. Simple and deterministic matrix sketching.
> In
> > > ACM
> > > > KDD
> > > > > > > > > Proceedings ‘13, pages 581– 588, 2013.
> > > > > > > > >
> > > > > > > > > * Edo Liberty, Michael Mitzenmacher, Justin Thaler, and
> > > Jonathan
> > > > > > > Ullman.
> > > > > > > > > Space lower bounds for itemset frequency sketches. In ACM
> > PODS
> > > > > > > Proceedings
> > > > > > > > > ‘16, pages 441–454, 2016.
> > > > > > > > >
> > > > > > > > > * Michael Mitzenmacher, Thomas Steinke, and Justin Thaler.
> > > > > > Hierarchical
> > > > > > > > > heavy hitters with the space saving algorithm. In SIAM
> ALENEX
> > > > > > > Proceedings
> > > > > > > > > ‘12, pages 160–174, 2012.
> > > > > > > > >
> > > > > > > > > == The Rationale for Sketches ==
> > > > > > > > >
> > > > > > > > > In the analysis of big data there are often problem queries
> > > that
> > > > > > don’t
> > > > > > > > > scale because they require huge compute resources and time
> to
> > > > > > generate
> > > > > > > > > exact results. Examples include count distinct, quantiles,
> > most
> > > > > > > frequent
> > > > > > > > > items, joins, matrix computations, and graph analysis.
> > > > > > > > >
> > > > > > > > > If we can loosen the requirement of “exact” results from
> our
> > > > queries
> > > > > > > and be
> > > > > > > > > satisfied with approximate results, within some well
> > understood
> > > > > > bounds
> > > > > > > of
> > > > > > > > > error, there is an entire branch of mathematics and data
> > > science
> > > > that
> > > > > > > has
> > > > > > > > > evolved around developing algorithms that can produce
> > > approximate
> > > > > > > results
> > > > > > > > > with mathematically well-defined error properties.
> > > > > > > > >
> > > > > > > > > With the additional requirements that these algorithms must
> > be
> > > > small
> > > > > > > > > (compared to the size of the input data), sublinear (the
> size
> > > of
> > > > the
> > > > > > > sketch
> > > > > > > > > must grow at a slower rate than the size of the input
> > stream),
> > > > > > > streaming
> > > > > > > > > (they can only touch each data item once), and mergeable
> > > > (suitable
> > > > > > for
> > > > > > > > > distributed processing), defines a class of algorithms that
> > can
> > > > be
> > > > > > > > > described as small, stochastic, streaming, sublinear
> > mergeable
> > > > > > > algorithms,
> > > > > > > > > commonly called sketches (they also have other names, but
> we
> > > > will use
> > > > > > > the
> > > > > > > > > term sketches from here on).
> > > > > > > > >
> > > > > > > > > To be truly streaming and be able to process data in a
> single
> > > > pass,
> > > > > > > > > sketches must make absolute minimum assumptions about the
> > input
> > > > > > stream.
> > > > > > > > > This is critically important, as there is no “second
> chance”
> > to
> > > > > > > process the
> > > > > > > > > data.
> > > > > > > > >
> > > > > > > > > For example, sketches should not make assumptions about the
> > > > order of
> > > > > > > stream
> > > > > > > > > items, the stream length, the dynamic range of values, or
> the
> > > > > > > distribution
> > > > > > > > > of item occurrence frequencies. Sketches should be tolerant
> > of
> > > > NaNs,
> > > > > > > Nulls
> > > > > > > > > and empty objects. About the only thing that the sketch
> needs
> > > to
> > > > know
> > > > > > > about
> > > > > > > > > the stream is how to extract items from it and what type
> the
> > > > item is,
> > > > > > > e.g.,
> > > > > > > > > is it a numeric value or a string.
> > > > > > > > >
> > > > > > > > > As far as the sketch is concerned, the input stream is a
> > > > sequence of
> > > > > > > items
> > > > > > > > > in some unknown random order with unknown random values.
> > > > > > > > >
> > > > > > > > > The sketch is essentially a complex state machine and
> > combined
> > > > with
> > > > > > the
> > > > > > > > > random input stream defines a stochastic process. We then
> > apply
> > > > > > > > > probabilistic methods to interpret the states of the
> > stochastic
> > > > > > > process in
> > > > > > > > > order to extract useful information about the input stream
> > > > itself.
> > > > > > The
> > > > > > > > > resulting information will be approximate, but we also use
> > > > additional
> > > > > > > > > probabilistic methods to extract an estimate of the likely
> > > > > > probability
> > > > > > > > > distribution of error.
> > > > > > > > >
> > > > > > > > > There is a significant scientific contribution here that is
> > > > defining
> > > > > > > the
> > > > > > > > > state machine, understanding the resulting stochastic
> > process,
> > > > > > > developing
> > > > > > > > > the probabilistic methods, and proving mathematically, that
> > it
> > > > all
> > > > > > > works!
> > > > > > > > > This is why the scientific contributors to this project
> are a
> > > > > > critical
> > > > > > > and
> > > > > > > > > strategic component to our success.  The development
> > engineers
> > > > > > > translate
> > > > > > > > > the concepts of the proposed state machine and
> probabilistic
> > > > methods
> > > > > > > into
> > > > > > > > > production-quality code. Even more important, they work
> > closely
> > > > with
> > > > > > > the
> > > > > > > > > scientists, feeding back system and user requirements,
> which
> > > > leads
> > > > > > not
> > > > > > > only
> > > > > > > > > to superior product design, but to new science as well.  A
> > > > number of
> > > > > > > > > scientific papers our members have published (see above)
> is a
> > > > direct
> > > > > > > result
> > > > > > > > > of this close collaboration.
> > > > > > > > >
> > > > > > > > > Because sketches are small they can be processed extremely
> > > fast,
> > > > > > often
> > > > > > > many
> > > > > > > > > orders-of-magnitude faster than traditional exact
> > computations.
> > > > For
> > > > > > > > > interactive queries there may not be other viable
> > alternatives,
> > > > and
> > > > > > in
> > > > > > > the
> > > > > > > > > case of real-time analysis, sketches are the only known
> > > solution.
> > > > > > > > >
> > > > > > > > > For any system that needs to extract useful information
> from
> > > > massive
> > > > > > > data
> > > > > > > > > sketches are essential tools that should be tightly
> > integrated
> > > > into
> > > > > > the
> > > > > > > > > system’s analysis capabilities. This technology has helped
> > > Yahoo
> > > > > > > > > successfully reduce data processing times from days to
> hours
> > or
> > > > > > > minutes on
> > > > > > > > > a number of its internal platforms and has enabled
> subsecond
> > > > queries
> > > > > > on
> > > > > > > > > real-time platforms that would have been infeasible without
> > > > sketches.
> > > > > > > > > The Rationale for Apache DataSketches
> > > > > > > > > Other open source implementations of sketch algorithms can
> be
> > > > found
> > > > > > on
> > > > > > > the
> > > > > > > > > Internet. However, we have not yet found any open source
> > > > > > > implementations
> > > > > > > > > that are as comprehensive, engineered with the quality
> > required
> > > > for
> > > > > > > > > production systems, and with usable and guaranteed error
> > > > properties.
> > > > > > > Large
> > > > > > > > > Internet companies, such as Google and Facebook, have
> > published
> > > > > > papers
> > > > > > > on
> > > > > > > > > sketching, however, their implementations of their
> published
> > > > > > > algorithms are
> > > > > > > > > proprietary and not available as open source.
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations
> with a
> > > > number
> > > > > > of
> > > > > > > > > major Apache data processing platforms such as Apache Hive,
> > > > Apache
> > > > > > Pig,
> > > > > > > > > Apache Spark and Apache Druid, and is also integrated with
> a
> > > > number
> > > > > > of
> > > > > > > > > other open source data processing platforms such as Splice
> > > > Machine,
> > > > > > > GCHQ
> > > > > > > > > Gaffer and PostgreSQL.
> > > > > > > > >
> > > > > > > > > We believe that having DataSketches as an Apache project
> will
> > > > provide
> > > > > > > an
> > > > > > > > > immediate, worthwhile, and substantial contribution to the
> > open
> > > > > > source
> > > > > > > > > community, will have a better opportunity to provide a
> > > meaningful
> > > > > > > > > contribution to both the science and engineering of
> sketching
> > > > > > > algorithms,
> > > > > > > > > and integrate with other Apache projects.  In addition,
> this
> > > is a
> > > > > > > > > significant opportunity for Apache to be the "go-to"
> > > destination
> > > > for
> > > > > > > users
> > > > > > > > > that want to leverage this exciting technology.
> > > > > > > > >
> > > > > > > > > == Initial Goals ==
> > > > > > > > >
> > > > > > > > > We are breaking our initial goals into short-term (2-6
> > months)
> > > > and
> > > > > > > > > intermediate to long-term ( 6 months to 2 years):
> > > > > > > > >
> > > > > > > > > Our short-term goals include:
> > > > > > > > >
> > > > > > > > > * Understanding and adapting to the Apache development
> > process
> > > > and
> > > > > > > > > structures.
> > > > > > > > >
> > > > > > > > > * Start refactoring codebase and move various DataSketches
> > > > > > repositories
> > > > > > > > > code to Apache Git repository.
> > > > > > > > >
> > > > > > > > > * Continue development of new features, functions, and
> fixes.
> > > > > > > > >
> > > > > > > > > * Specific sub-projects (e.g., C++ and Python) will
> continue
> > to
> > > > be
> > > > > > > > > developed and expanded.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > The intermediate to long term goals include:
> > > > > > > > >
> > > > > > > > > * Completing the design and implementation of the C++
> > sketches
> > > to
> > > > > > > > > complement what is already available in Java, and the
> Python
> > > > wrappers
> > > > > > > of
> > > > > > > > > those C++ sketches.
> > > > > > > > >
> > > > > > > > > * Expanding the C++ build framework to include Windows and
> > the
> > > > > > popular
> > > > > > > > > Linux variants.
> > > > > > > > >
> > > > > > > > > * Continued engagement with the scientific research
> community
> > > on
> > > > the
> > > > > > > > > development of new algorithms for computationally difficult
> > > > problems
> > > > > > > that
> > > > > > > > > heretofore have not had a sketching solution.
> > > > > > > > >
> > > > > > > > > == Current Status ==
> > > > > > > > >
> > > > > > > > > The DataSketches GitHub project has been quite successful.
> > As
> > > of
> > > > > > this
> > > > > > > > > writing (Feb, 2019) the number of downloads measured by the
> > > Nexus
> > > > > > > > > Repository Manager at https://oss.sonatype.org has grown
> by
> > > > nearly a
> > > > > > > > > factor
> > > > > > > > > of 10 over the past year to about 55 thousand per month.
> The
> > > > > > > > > DataSketches/sketches-core repository has about 560 stars
> and
> > > 141
> > > > > > > forks,
> > > > > > > > > which is pretty good for a highly specialized library.
> > > > > > > > >
> > > > > > > > > === Development Practices ===
> > > > > > > > >
> > > > > > > > > ==== Source Control ====
> > > > > > > > >
> > > > > > > > > All of our developers have extensive experience with Git
> > > version
> > > > > > > control
> > > > > > > > > and follow accepted practices for use of Pull Requests
> (PRs),
> > > > code
> > > > > > > reviews
> > > > > > > > > and commits to master, for example.
> > > > > > > > >
> > > > > > > > > ==== Testing ====
> > > > > > > > >
> > > > > > > > > Sketches, by their nature are probabilistic programs and
> > don’t
> > > > > > > necessarily
> > > > > > > > > behave deterministically.  For some of the sketches we
> > > > intentionally
> > > > > > > insert
> > > > > > > > > random noise into the code as this gives us the
> mathematical
> > > > > > properties
> > > > > > > > > that we need to guarantee accuracy.  This can make the
> > behavior
> > > > of
> > > > > > > these
> > > > > > > > > algorithms quite unintuitive and provides significant
> > > challenges
> > > > to
> > > > > > the
> > > > > > > > > developer who wishes to test these algorithms for
> > correctness.
> > > > As a
> > > > > > > result,
> > > > > > > > > our testing strategy includes two major components: unit
> > tests,
> > > > and
> > > > > > > > > characterization tests.
> > > > > > > > >
> > > > > > > > > ===== Unit Testing =====
> > > > > > > > >
> > > > > > > > > Our unit tests are primarily quick tests to make sure that
> we
> > > > > > exercise
> > > > > > > all
> > > > > > > > > critical paths in the code and that key branches are
> executed
> > > > > > > correctly. It
> > > > > > > > > is important that they execute relatively fast as they are
> > > > generally
> > > > > > > run on
> > > > > > > > > every code build. The sketches-core repository alone has
> > about
> > > 22
> > > > > > > thousand
> > > > > > > > > statements, over 1300 unit tests and code coverage of about
> > > > 98.2% as
> > > > > > > > > measured by Atlassian/Clover.  It is our goal for all of
> our
> > > code
> > > > > > > > > repositories that are used in production that they have
> code
> > > > coverage
> > > > > > > > > greater than 90%.
> > > > > > > > >
> > > > > > > > > ===== Characterization Testing =====
> > > > > > > > >
> > > > > > > > > In order to test the probabilistic methods that are used to
> > > > interpret
> > > > > > > the
> > > > > > > > > stochastic behaviors of our sketches we have a separate
> > > > > > > characterization
> > > > > > > > > repository that is dedicated to this.  To measure accuracy,
> > for
> > > > > > > example,
> > > > > > > > > requires running thousands of trials at each of many
> > different
> > > > points
> > > > > > > along
> > > > > > > > > the domain axis. Each trial compares its estimated results
> > > > against a
> > > > > > > known
> > > > > > > > > exact result producing an error for that trial.  These
> error
> > > > > > > measurements
> > > > > > > > > are then fed into our Quantiles sketch to capture the
> actual
> > > > > > > distribution
> > > > > > > > > of error at that point along the axis. We then select
> > quantile
> > > > > > contours
> > > > > > > > > across all the distributions at points along the axis.
> These
> > > > > > contours
> > > > > > > can
> > > > > > > > > then be plotted to reveal the shape of the actual error
> > > > distribution.
> > > > > > > These
> > > > > > > > > distributions are not at all Gaussian, in fact they can be
> > > quite
> > > > > > > complex.
> > > > > > > > > Nonetheless, these distributions are then checked against
> our
> > > > > > > statistical
> > > > > > > > > guarantees inherent to the specific sketch algorithm and
> its
> > > > > > > parameters.
> > > > > > > > > There are many examples of these characterization error
> > > > distributions
> > > > > > > on
> > > > > > > > > our website. The runtimes of these tests can be very long
> and
> > > can
> > > > > > range
> > > > > > > > > from many minutes to hours, and some can run for days.
> > > > Currently, we
> > > > > > > have
> > > > > > > > > separate characterization repositories for Java and C++ /
> > > Python.
> > > > > > > > >
> > > > > > > > > It is our goal that we perform this characterization
> analysis
> > > > for all
> > > > > > > of
> > > > > > > > > our sketches.  By definition, the code that runs these
> > > > > > characterization
> > > > > > > > > tests is open-source so others can run these tests as well.
> > We
> > > > do
> > > > > > not
> > > > > > > have
> > > > > > > > > formal releases of this code (because it is not production
> > > code)
> > > > and
> > > > > > > it is
> > > > > > > > > not published to Maven Central.
> > > > > > > > >
> > > > > > > > > === Meritocracy ===
> > > > > > > > >
> > > > > > > > > DataSketches was initially developed based on requirements
> > > within
> > > > > > > Yahoo. As
> > > > > > > > > a project on GitHub, DataSketches has received
> contributions
> > > from
> > > > > > > numerous
> > > > > > > > > individual developers from around the world, dedicated
> > research
> > > > work
> > > > > > > from
> > > > > > > > > senior scientists at Amazon and Visa, and academic
> > researchers
> > > > from
> > > > > > > > > Georgetown University, Princeton, and MIT.
> > > > > > > > >
> > > > > > > > > As a project under incubation, we are committed to
> expanding
> > > our
> > > > > > > effort to
> > > > > > > > > build an environment which supports a meritocracy. We are
> > > > focused on
> > > > > > > > > engaging the community and other related projects for
> support
> > > and
> > > > > > > > > contributions. Moreover, we are committed to ensure
> > > contributors
> > > > and
> > > > > > > > > committers to DataSketches come from a broad mix of
> > > organizations
> > > > > > > through a
> > > > > > > > > merit-based decision process during incubation. We believe
> > > > strongly
> > > > > > in
> > > > > > > the
> > > > > > > > > DataSketches premise that fulfills the concept of a well
> > > > engineered
> > > > > > and
> > > > > > > > > scientifically rigorous library that implements these
> > powerful
> > > > > > > algorithms
> > > > > > > > > and are committed to growing an inclusive community of
> > > > DataSketches
> > > > > > > > > contributors and users.
> > > > > > > > >
> > > > > > > > > === Community ===
> > > > > > > > >
> > > > > > > > > Yahoo has a long history and active engagement in the Open
> > > Source
> > > > > > > > > community. Major projects include: Vespa.ai, Bullet,
> Moloch,
> > > > > > Panoptes,
> > > > > > > > > Screwdriver.cd, Athenz, HaloDB, Maha, Mendel,
> > > TensorFlowOnSpark,
> > > > > > > gifshot,
> > > > > > > > > fluxible, as well as the creation, contribution and
> > incubation
> > > of
> > > > > > many
> > > > > > > > > Apache projects such as Apache Hadoop, Pig, Bookkeeper,
> > Oozie,
> > > > > > > Zookeeper,
> > > > > > > > > Omid, Pulsar, Traffic Server, Storm, Druid, and many more.
> > > > > > > > >
> > > > > > > > > Every day, DataSketches is actively used by a organizations
> > and
> > > > > > > > > institutions around the world for batch and stream
> processing
> > > of
> > > > > > data.
> > > > > > > We
> > > > > > > > > believe acceptance will allow us to consolidate existing
> > > > > > > > > DataSketches-related work, grow the DataSketches community,
> > and
> > > > > > deepen
> > > > > > > > > connections between DataSketches and other open source
> > > projects.
> > > > > > > > >
> > > > > > > > > === Introduction to the Core Developers & Contributors ===
> > > > > > > > >
> > > > > > > > > The core developers and contributors for DataSketches are
> > from
> > > > > > diverse
> > > > > > > > > backgrounds, but primarily are scientists that love
> > engineering
> > > > and
> > > > > > > > > engineers that love science. A large part of the value we
> > bring
> > > > comes
> > > > > > > from
> > > > > > > > > this synthesis.  These individuals have already contributed
> > > > > > > substantially
> > > > > > > > > to the code, algorithms, and/or mathematical proofs that
> form
> > > the
> > > > > > > basis of
> > > > > > > > > the library.
> > > > > > > > >
> > > > > > > > > This core group also form the Initial Committers with write
> > > > > > > permissions to
> > > > > > > > > the repository. Those marked with (*) Meet weekly to plan
> the
> > > > > > research
> > > > > > > and
> > > > > > > > > engineering direction of the project.
> > > > > > > > >
> > > > > > > > > ==== Scientists That Love Engineering ====
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel: Senior Research Scientist, Yahoo Labs,
> > Israel.
> > > > > > > Interests:
> > > > > > > > > distributed systems, scalable systems and platforms for big
> > > data
> > > > > > > > > processing, concurrent algorithms and data structures,
> > > > > > > > >
> > > > > > > > > * Kevin Lang: (*) Distinguished Research Scientist, Yahoo
> > Labs,
> > > > > > > Sunnyvale,
> > > > > > > > > California. Interests: algorithms, theoretical and applied
> > > > > > mathematics,
> > > > > > > > > encoding and compression theory, theoretical and applied
> > > > performance
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Edo Liberty: (*) Director of Research, Head of Amazon AI
> > > Labs,
> > > > Palo
> > > > > > > Alto,
> > > > > > > > > California. Manages the algorithms group at Amazon AI. We
> > build
> > > > > > > scalable
> > > > > > > > > machine learning systems and algorithms which are used both
> > > > > > internally
> > > > > > > and
> > > > > > > > > externally by customers of SageMaker, AWS's flagship
> machine
> > > > learning
> > > > > > > > > platform.
> > > > > > > > >
> > > > > > > > > * Jon Malkin: (*) Senior Scientist, Yahoo Labs, Sunnyvale.
> > > > Interests:
> > > > > > > > > Computational advertising, machine learning, speech
> > > recognition,
> > > > > > > > > data-driven analysis, large scale experimentation, big
> data,
> > > > > > > stream/complex
> > > > > > > > > event processing
> > > > > > > > >
> > > > > > > > > * Justin Thaler: (*) Assistant Professor, Department of
> > > Computer
> > > > > > > Science,
> > > > > > > > > Georgetown University, Washington D.C. Interests:
> algorithms
> > > and
> > > > > > > > > computational complexity, complexity theory, quantum
> > > algorithms,
> > > > > > > private
> > > > > > > > > data analysis, and learning theory, developing efficient
> > > > streaming
> > > > > > and
> > > > > > > > > sketching algorithms
> > > > > > > > >
> > > > > > > > > ==== Engineers That Love Science ====
> > > > > > > > >
> > > > > > > > > * Roman Leventov: Senior Software Engineer,  Metamarkets /
> > > Snap.
> > > > > > > Interests:
> > > > > > > > > design and implementation of data storing and data
> processing
> > > > > > > (distributed)
> > > > > > > > > systems, performance optimization, CPU performance,
> > mechanical
> > > > > > > sympathy,
> > > > > > > > > JVM performance, API design, databases, (concurrent) data
> > > > structures,
> > > > > > > > > memory management, garbage collection algorithms, language
> > > > design and
> > > > > > > > > runtimes (their tradeoffs), distributed systems (cloud)
> > > > efficiency,
> > > > > > > Linux,
> > > > > > > > > code quality, code transformation, pure functional
> > programming
> > > > > > models,
> > > > > > > > > Haskell.
> > > > > > > > >
> > > > > > > > > * Lee Rhodes: (*) Distinguished Architect, lead developer
> and
> > > > founder
> > > > > > > of
> > > > > > > > > the DataSketches project, Yahoo, Sunnyvale, California.
> > > > Interests:
> > > > > > > > > streaming algorithms, mathematics, computer science, high
> > > > quality and
> > > > > > > high
> > > > > > > > > performance code for the analysis of massive data, bridging
> > the
> > > > > > divide
> > > > > > > > > between theory and practice.
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov: (*) Senior Software Engineer, Yahoo,
> > > > Sunnyvale,
> > > > > > > > > California. Interests: applied mathematics, computer
> science,
> > > big
> > > > > > data,
> > > > > > > > > distributed systems.
> > > > > > > > >
> > > > > > > > > === Introduction to Additional Interested Contributors ===
> > > > > > > > >
> > > > > > > > > These folks have been intermittently involved and
> > contributed,
> > > > but
> > > > > > are
> > > > > > > > > strong supporters of this project.
> > > > > > > > >
> > > > > > > > > * Frank Grimes: GitHub ID: frankgrimes97
> > > > > > > > >
> > > > > > > > > * Mina Ghashami: [mina.ghashami at gmail dot com] Ph.D.
> > > Computer
> > > > > > > Science,
> > > > > > > > > Univ of Utah. Interests: Machine Learning, Data Mining,
> > matrix
> > > > > > > > > approximation, streaming algorithms, randomized linear
> > algebra.
> > > > > > > > >
> > > > > > > > > * Christopher Musco: [christopher.musco at gmail dot com]
> > Ph.D.
> > > > > > > Computer
> > > > > > > > > Science, Research Instructor, Princeton University.
> > Interests:
> > > > > > > algorithmic
> > > > > > > > > foundations of data science and machine learning, efficient
> > > > methods
> > > > > > for
> > > > > > > > > processing and understanding large datasets, often working
> at
> > > the
> > > > > > > > > intersection of theoretical computer science, numerical
> > linear
> > > > > > > algebra, and
> > > > > > > > > optimization.
> > > > > > > > >
> > > > > > > > > * Graham Cormode: [g.cormode at warwick.ac dot uk] Ph.D.
> > > > Computer
> > > > > > > Science,
> > > > > > > > > Professor, Warwick University, Warwick, England. Interests:
> > all
> > > > > > > aspects of
> > > > > > > > > the "data lifecycle", from data collection and cleaning,
> > > through
> > > > > > > mining and
> > > > > > > > > analytics. (Professor Cormode is one of the world’s leading
> > > > > > scientists
> > > > > > > in
> > > > > > > > > sketching algorithms)
> > > > > > > > >
> > > > > > > > > === Alignment ===
> > > > > > > > >
> > > > > > > > > The DataSketches library already provides integrations and
> > > > example
> > > > > > > code for
> > > > > > > > > Apache Hive, Apache Pig, Apache Spark and is deeply
> > integrated
> > > > into
> > > > > > > Apache
> > > > > > > > > Druid.
> > > > > > > > >
> > > > > > > > > == Known Risks ==
> > > > > > > > >
> > > > > > > > > The following subsections are specific risks that have been
> > > > > > identified
> > > > > > > by
> > > > > > > > > the ASF that need to be addressed.
> > > > > > > > >
> > > > > > > > > === Risk: Orphaned Products ===
> > > > > > > > >
> > > > > > > > > The DataSketches library is presently used by a number of
> > > > > > > organizations,
> > > > > > > > > from small startups to Fortune 100 companies, to construct
> > > > production
> > > > > > > > > pipelines that must process and analyze massive data. Yahoo
> > > has a
> > > > > > > long-term
> > > > > > > > > commitment to continue to advance the DataSketches library;
> > > > moreover,
> > > > > > > > > DataSketches is seeing increasing interest, development,
> and
> > > > adoption
> > > > > > > from
> > > > > > > > > many diverse organizations from around the world. Due to
> its
> > > > growing
> > > > > > > > > adoption, we feel it is quite unlikely that this project
> > would
> > > > become
> > > > > > > > > orphaned.
> > > > > > > > >
> > > > > > > > > === Risk: Inexperience with Open Source ===
> > > > > > > > >
> > > > > > > > > Yahoo believes strongly in open source and the exchange of
> > > > > > information
> > > > > > > to
> > > > > > > > > advance new ideas and work. Examples of this commitment are
> > > > active
> > > > > > open
> > > > > > > > > source projects such as those mentioned above. With
> > > > DataSketches, we
> > > > > > > have
> > > > > > > > > been increasingly open and forward-looking; we have
> > published a
> > > > > > number
> > > > > > > of
> > > > > > > > > papers about breakthrough developments in the science of
> > > > streaming
> > > > > > > > > algorithms (mentioned above) that also reference the
> > > DataSketches
> > > > > > > library.
> > > > > > > > > Our submission to the Apache Software Foundation is a
> logical
> > > > > > > extension of
> > > > > > > > > our commitment to open source software.
> > > > > > > > >
> > > > > > > > > Key committers at Yahoo with strong open source backgrounds
> > > > include
> > > > > > > Aaron
> > > > > > > > > Gresch, Alan Carroll, Alessandro Bellina, Anastasia
> > Braginsky,
> > > > > > Andrews
> > > > > > > > > Sahaya Albert, Arun S A G, Atul Mohan, Brad McMillen, Bryan
> > > Call,
> > > > > > Daryn
> > > > > > > > > Sharp, Dav Glass, David Carlin, Derek Dagit, Eric Payne,
> > Eshcar
> > > > > > Hillel,
> > > > > > > > > Ethan Li, Fei Deng, Francis Christopher Liu, Francisco
> > > > > > Perez-Sorrosal,
> > > > > > > Gil
> > > > > > > > > Yehuda. Govind Menon, Hang Yang, Jacob Estelle, Jai Asher,
> > > James
> > > > > > > Penick,
> > > > > > > > > Jason Kenny, Jay Pipes, Jim Rollenhagen, Joe Francis, Jon
> > > Eagles,
> > > > > > > Kihwal
> > > > > > > > > Lee, Kishorkumar Patil, Koji Noguchi, Kuhu Shukla, Michael
> > > > Trelinski,
> > > > > > > > > Mithun Radhakrishnan, Nathan Roberts, Ohad Shacham, Olga L.
> > > > > > Natkovich,
> > > > > > > > > Parth Kamlesh Gandhi, Rajan Dhabalia, Rohini Palaniswamy,
> > Ruby
> > > > Loo,
> > > > > > > Ryan
> > > > > > > > > Bridges, Sanket Chintapalli, Satish Subhashrao Saley, Shu
> Kit
> > > > Chan,
> > > > > > Sri
> > > > > > > > > Harsha Mekala, Susan Hinrichs, Yonatan Gottesman, and many
> > > more.
> > > > > > > > >
> > > > > > > > > All of our core developers are committed to learn about the
> > > > Apache
> > > > > > > process
> > > > > > > > > and to give back to the community.
> > > > > > > > >
> > > > > > > > > === Risk: Homogeneous Developers ===
> > > > > > > > >
> > > > > > > > > The majority of committers in this proposal belong to Yahoo
> > due
> > > > to
> > > > > > the
> > > > > > > fact
> > > > > > > > > that DataSketches has emerged from an internal Yahoo
> project.
> > > > This
> > > > > > > proposal
> > > > > > > > > also includes developers and contributors from other
> > companies,
> > > > and
> > > > > > > who are
> > > > > > > > > actively involved with other Apache projects, such as
> Druid.
> > > We
> > > > > > > expect our
> > > > > > > > > entry into incubation will allow us to expand the number of
> > > > > > > individuals and
> > > > > > > > > organizations participating in DataSketches development.
> > > > > > > > >
> > > > > > > > > === Risk: Reliance on Salaried Developers ===
> > > > > > > > >
> > > > > > > > > Because the DataSketches library originated within Yahoo,
> it
> > > has
> > > > been
> > > > > > > > > developed primarily by salaried Yahoo developers and we
> > expect
> > > > that
> > > > > > to
> > > > > > > > > continue to be the case near term. However, since we placed
> > > this
> > > > > > > library
> > > > > > > > > into open-source we have had a number of significant
> > > > contributions
> > > > > > from
> > > > > > > > > engineers and scientists from outside of Yahoo. We expect
> our
> > > > > > reliance
> > > > > > > on
> > > > > > > > > Yahoo salaried developers will decrease over time.
> > Nonetheless,
> > > > Yahoo
> > > > > > > is
> > > > > > > > > committed to continue its strong support of this important
> > > > project.
> > > > > > > > >
> > > > > > > > > === Risk: Lack of Relationship to other Apache Products ===
> > > > > > > > >
> > > > > > > > > DataSketches already directly interoperates with or
> utilizes
> > > > several
> > > > > > > > > existing Apache projects.
> > > > > > > > >
> > > > > > > > > * Build
> > > > > > > > >    * Apache Maven
> > > > > > > > >
> > > > > > > > > * Integrations and adaptors for the following projects
> > > naturally
> > > > have
> > > > > > > them
> > > > > > > > > as dependencies
> > > > > > > > >    * Apache Hive
> > > > > > > > >    * Apache Pig
> > > > > > > > >    * Apache Druid
> > > > > > > > >    * Apache Spark
> > > > > > > > >
> > > > > > > > > * Additional dependencies for the above integrations and
> > > adaptors
> > > > > > > include
> > > > > > > > >    * Apache Hadoop
> > > > > > > > >    * Apache Commons (Math)
> > > > > > > > >
> > > > > > > > > There is no other Apache project that we are aware of that
> > > > duplicates
> > > > > > > the
> > > > > > > > > functionality of the DataSketches library.
> > > > > > > > >
> > > > > > > > > === Risk: An Excessive Fascination with the Apache Brand
> ===
> > > > > > > > >
> > > > > > > > > With this proposal we are not seeking attention or
> publicity.
> > > > Rather,
> > > > > > > we
> > > > > > > > > firmly believe in the DataSketches library and concept and
> > the
> > > > > > ability
> > > > > > > to
> > > > > > > > > make the DataSketches library a powerful, yet simple-to-use
> > > > toolkit
> > > > > > for
> > > > > > > > > data processing. While the DataSketches library has been
> open
> > > > source,
> > > > > > > we
> > > > > > > > > believe putting code on GitHub can only go so far. We see
> the
> > > > Apache
> > > > > > > > > community, processes, and mission as critical for ensuring
> > the
> > > > > > > DataSketches
> > > > > > > > > library is truly community-driven, positively impactful,
> and
> > > > > > innovative
> > > > > > > > > open source software. While Yahoo has taken a number of
> steps
> > > to
> > > > > > > advance
> > > > > > > > > its various open source projects, we believe the
> DataSketches
> > > > library
> > > > > > > > > project is a great fit for the Apache Software Foundation
> due
> > > to
> > > > its
> > > > > > > focus
> > > > > > > > > on data processing and its relationships to existing ASF
> > > > projects.
> > > > > > > > >
> > > > > > > > > === Risk: Cryptography ===
> > > > > > > > >
> > > > > > > > > DataSketches does not contain any cryptographic code and is
> > > not a
> > > > > > > > > cryptographic product.
> > > > > > > > >
> > > > > > > > > == Documentation ==
> > > > > > > > >
> > > > > > > > > The following documentation is relevant to this proposal.
> > > > Relevant
> > > > > > > portions
> > > > > > > > > of the documentation will be contributed to the Apache
> > > > DataSketches
> > > > > > > > > project.
> > > > > > > > >
> > > > > > > > > * DataSketches website: https://datasketches.github.io.
> > > > > > > > >
> > > > > > > > > * DataSketches website repository:
> > > > > > > > > https://github.com/DataSketches/DataSketches.github.io
> > > > > > > > >
> > > > > > > > > We will need an apache website for this documentation
> similar
> > > to
> > > > > > > > >
> > > > > > > > > * https://datasketches.apache.org
> > > > > > > > >
> > > > > > > > > == Initial Source ==
> > > > > > > > >
> > > > > > > > > The initial source for DataSketches which we will submit to
> > the
> > > > > > Apache
> > > > > > > > > Foundation will include a number of repositories which are
> > > > currently
> > > > > > > hosted
> > > > > > > > > under the GitHub.com/datasketches organization:
> > > > > > > > >
> > > > > > > > > All github.com/datasketches repositories including:
> > > > > > > > >
> > > > > > > > > * Java
> > > > > > > > >    * sketches-core: This repository has the core sketching
> > > > classes,
> > > > > > > which
> > > > > > > > > are leveraged by some of the other repositories. This
> > > repository
> > > > has
> > > > > > no
> > > > > > > > > external dependencies outside of the DataSketches/memory
> > > > repository,
> > > > > > > Java
> > > > > > > > > and TestNG for unit tests. This code is versioned and the
> > > latest
> > > > > > > release
> > > > > > > > > can be obtained from Maven Central.
> > > > > > > > >    * memory: Low level, high-performance memory
> > data-structure
> > > > > > > management
> > > > > > > > > primarily for off-heap.
> > > > > > > > >    * sketches-android: This is a new repository dedicated
> to
> > > > sketches
> > > > > > > > > designed to be run in a mobile client, such as a cell
> phone.
> > It
> > > > is
> > > > > > > still in
> > > > > > > > > development and should be considered experimental.
> > > > > > > > >    * sketches-hive: This repository contains Hive UDFs and
> > > UDAFs
> > > > for
> > > > > > > use
> > > > > > > > > within Hadoop grid environments. This code has dependencies
> > on
> > > > > > > > > sketches-core as well as Hadoop and Hive. Users of this
> code
> > > are
> > > > > > > advised to
> > > > > > > > > use Maven to bring in all the required dependencies. This
> > code
> > > is
> > > > > > > versioned
> > > > > > > > > and the latest release can be obtained from Maven Central.
> > > > > > > > >    * sketches-pig: This repository contains Pig User
> Defined
> > > > > > Functions
> > > > > > > > > (UDF) for use within Hadoop grid environments. This code
> has
> > > > > > > dependencies
> > > > > > > > > on sketches-core as well as Hadoop and Pig. Users of this
> > code
> > > > are
> > > > > > > advised
> > > > > > > > > to use Maven to bring in all the required dependencies.
> This
> > > > code is
> > > > > > > > > versioned and the latest release can be obtained from Maven
> > > > Central.
> > > > > > > > >    * sketches-vector: This is a new repository dedicated to
> > > > sketches
> > > > > > > for
> > > > > > > > > vector and matrix operations. It is still somewhat
> > > experimental.
> > > > > > > > >    * characterization: This relatively new repository is
> for
> > > code
> > > > > > that
> > > > > > > we
> > > > > > > > > use to characterize the accuracy and speed performance of
> the
> > > > > > sketches
> > > > > > > in
> > > > > > > > > the library and is constantly being updated. Examples of
> the
> > > job
> > > > > > > command
> > > > > > > > > files used for various tests can be found in the
> > > > src/main/resources
> > > > > > > > > directory. Some of these tests can run for hours depending
> on
> > > its
> > > > > > > > > configuration.
> > > > > > > > >    * experimental: This repository is an experimental
> staging
> > > > area
> > > > > > for
> > > > > > > code
> > > > > > > > > that will eventually end up in another repository. This
> code
> > is
> > > > not
> > > > > > > > > versioned and not registered with Maven Central.
> > > > > > > > >    * sketches-misc: Demos and other code not related to
> > > > production
> > > > > > > > > deployment
> > > > > > > > >
> > > > > > > > > * C++ and Python
> > > > > > > > >    * sketches-core-cpp: This is the C++/Python companion to
> > the
> > > > Java
> > > > > > > > > sketches-core. These implementations are binary compatible
> > with
> > > > their
> > > > > > > > > counterparts in Java. In other words, a sketch created and
> > > > stored in
> > > > > > > C++
> > > > > > > > > can be opened and read in Java and visa-versa. This site
> also
> > > > has our
> > > > > > > > > Python adaptors that basically wrap the C++
> implementations,
> > > > making
> > > > > > the
> > > > > > > > > high performance C++ implementations available from Python.
> > > > > > > > >    * sketches-postgres: This site provides the
> > > postgres-specific
> > > > > > > adaptors
> > > > > > > > > that wrap the C++ implementations making them available to
> > the
> > > > > > Postgres
> > > > > > > > > database users.
> > > > > > > > >    * characterization-cpp: This is the C++/Python companion
> > to
> > > > the
> > > > > > Java
> > > > > > > > > characterization repository.
> > > > > > > > >    * experimental-cpp: This repository is an experimental
> > > staging
> > > > > > area
> > > > > > > for
> > > > > > > > > C++ code that will eventually end up in another repository.
> > > > > > > > >
> > > > > > > > > * Command-Line Tools
> > > > > > > > >    * sketches-cmd
> > > > > > > > >    * homebrew-sketches
> > > > > > > > >    * homebrew-sketches-cmd
> > > > > > > > >
> > > > > > > > > These projects have always been Apache 2.0 licensed. We
> > intend
> > > to
> > > > > > > bundle
> > > > > > > > > all of these repositories since they are all complementary
> > and
> > > > should
> > > > > > > be
> > > > > > > > > maintained in one project. Prior to our submission, we will
> > > > combine
> > > > > > > all of
> > > > > > > > > these projects into a new git repository.
> > > > > > > > >
> > > > > > > > > == Source and Intellectual Property Submission Plan ==
> > > > > > > > >
> > > > > > > > > Contributors to the DataSketches project have also signed
> the
> > > > Yahoo
> > > > > > > > > Individual Contributor License Agreement (
> > > > > > > https://yahoocla.herokuapp.com/
> > > > > > > > > in order to contribute to the project.
> > > > > > > > >
> > > > > > > > > With respect to trademark rights, Yahoo does not hold a
> > > > trademark on
> > > > > > > the
> > > > > > > > > phrase “DataSketches.” Based on feedback and guidance we
> > > receive
> > > > > > > during the
> > > > > > > > > incubation process, we are open to renaming the project if
> > > > necessary
> > > > > > > for
> > > > > > > > > trademark or other concerns, but we would prefer not to
> have
> > to
> > > > do
> > > > > > > that.
> > > > > > > > >
> > > > > > > > > == External Dependencies ==
> > > > > > > > >
> > > > > > > > > All external dependencies are licensed under an Apache 2.0
> or
> > > > > > > > > Apache-compatible license. As we grow the DataSketches
> > > community
> > > > we
> > > > > > > will
> > > > > > > > > configure our build process to require and validate all
> > > > contributions
> > > > > > > and
> > > > > > > > > dependencies are licensed under the Apache 2.0 license or
> are
> > > > under
> > > > > > an
> > > > > > > > > Apache-compatible license.
> > > > > > > > >
> > > > > > > > > == Required Resources ==
> > > > > > > > >
> > > > > > > > > === Mailing Lists ===
> > > > > > > > >
> > > > > > > > > We currently use a mix of mailing lists. We will migrate
> our
> > > > existing
> > > > > > > > > mailing lists to the following:
> > > > > > > > >
> > > > > > > > > * d...@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * u...@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * priv...@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > * comm...@datasketches.incubator.apache.org
> > > > > > > > >
> > > > > > > > > === Source Control ===
> > > > > > > > >
> > > > > > > > > The DataSketches team currently uses Git and would like to
> > > > continue
> > > > > > to
> > > > > > > do
> > > > > > > > > so. We request a Git repository for DataSketches with
> > mirroring
> > > > to
> > > > > > > GitHub
> > > > > > > > > enabled similar the following:
> > > > > > > > >
> > > > > > > > > * https://github.com/apache/incubator-datasketches.git
> > > > > > > > >
> > > > > > > > > === Issue Tracking ===
> > > > > > > > >
> > > > > > > > > We request the creation of an Apache-hosted JIRA. The
> > > > DataSketches
> > > > > > > project
> > > > > > > > > is currently using the public GitHub issue tracker and the
> > > public
> > > > > > > Google
> > > > > > > > > Groups forum/sketches-user for issue tracking and
> > discussions.
> > > We
> > > > > > will
> > > > > > > > > migrate and combine from these two sources to the Apache
> > JIRA.
> > > > > > > > >
> > > > > > > > > Proposed Jira ID: DATASKETCHES
> > > > > > > > >
> > > > > > > > > == Initial Committers ==
> > > > > > > > >
> > > > > > > > > The following list of individuals have been extremely
> active
> > in
> > > > our
> > > > > > > > > community and should have write (commit) permissions to the
> > > > > > repository.
> > > > > > > > >
> > > > > > > > > * Eshcar Hillel                      [eshcar at
> verizonmedia
> > > dot
> > > > com]
> > > > > > > > >
> > > > > > > > > * Kevin Lang                    [langk at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Roman Leventov              [roman.leventov at
> > c.metamarkets
> > > > dot
> > > > > > com]
> > > > > > > > >
> > > > > > > > > * Edo Liberty                   [libertye at amazon dot
> com]
> > > > > > > > >
> > > > > > > > > * Jon Malkin                    [jmalkin at verizonmedia
> dot
> > > com]
> > > > > > > > >
> > > > > > > > > * Lee Rhodes                  [lrhodes at verizonmedia dot
> > > com] &
> > > > > > > [leerho
> > > > > > > > > at gmail dot com]
> > > > > > > > >
> > > > > > > > > * Alexander Saydakov         [saydakov at verizonmedia dot
> > com]
> > > > > > > > >
> > > > > > > > > * Justin Thaler                 [justin.thaler at
> georgetown
> > > dot
> > > > edu]
> > > > > > > > >
> > > > > > > > > == Affiliations ==
> > > > > > > > >
> > > > > > > > > The initial committers are from four organizations: Yahoo,
> > > > Amazon,
> > > > > > > > > Georgetown University, and Metamarkets/Snap.
> > > > > > > > >
> > > > > > > > > === Champion ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré,[[jb at nanthrax dot net]
> > > > > > > > >
> > > > > > > > > === Nominated Mentors ===
> > > > > > > > > (Recommended to me: )
> > > > > > > > >
> > > > > > > > > Liang Chen, Vice President of Apache CarbonData,
> > [chenliang613
> > > at
> > > > > > > apache
> > > > > > > > > dot org]
> > > > > > > > > Jean-Baptiste Onofré, jb at nanthrax dot net
> > > > > > > > > Gil Yehuda, gyehuda at verizonmedia dot com
> > > > > > > > >
> > > > > > > > > === Sponsoring Entity ===
> > > > > > > > >
> > > > > > > > > * The Apache Incubator    **** This is our 1st choice ****
> > > > > > > > >
> > > > > > > > > * Apache Druid. The incubating Apache Druid project might
> > also
> > > > be a
> > > > > > > logical
> > > > > > > > > sponsor. However, DataSketches has applications in many
> areas
> > > of
> > > > > > > computing
> > > > > > > > > outside of Druid so our preference and recommendation is
> that
> > > > > > > DataSketches
> > > > > > > > > would ultimately be a top-level Apache project.
> > > > > > > > >
> > > > > > > > > ________________
> > > > > > > > > [1] In 2017 Verizon acquired Yahoo and merged it with
> > > previously
> > > > > > > acquired
> > > > > > > > > AOL. The merged entity was originally called Oath, Inc.,
> but
> > > has
> > > > > > > recently
> > > > > > > > > been renamed Verizon Media, Inc., a wholly-owned subsidiary
> > of
> > > > > > Verizon,
> > > > > > > > > Inc.  Since Yahoo is the more recognized name, references
> in
> > > this
> > > > > > > document
> > > > > > > > > to Yahoo, are also a reference to Verizon Media, Inc.
> > > > > > > > >
> > > > > > > > > On Fri, Feb 22, 2019 at 9:35 PM Kenneth Knowles <
> > > k...@apache.org
> > > > >
> > > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > The subject line has me interested already. Follow
> examples
> > > > like
> > > > > > this
> > > > > > > > > > maybe?
> > > > > > > > > >
> > > > > > > > > > 1.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/a5db74cc9e5ae89b3bfa5f4b07bfcc18dae84b7098232fb897cd47b7@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > > 2.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > >
> > > > > >
> > > >
> > >
> >
> https://lists.apache.org/thread.html/5a7f6a218b11a1cac61fbd53f4c995fd7716f8ad3751cf9f171ebd57@%3Cgeneral.incubator.apache.org%3E
> > > > > > > > > >
> > > > > > > > > > Kenn
> > > > > > > > > >
> > > > > > > > > > On Fri, Feb 22, 2019 at 8:05 PM leerho <lee...@gmail.com
> >
> > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > I'll try again ... :)
> > > > > > > > > > >
> > > > > > > > > > > On Fri, Feb 22, 2019 at 8:00 PM Ted Dunning <
> > > > > > ted.dunn...@gmail.com
> > > > > > > >
> > > > > > > > > > wrote:
> > > > > > > > > > >
> > > > > > > > > > >> It didn't make it again
> > > > > > > > > > >>
> > > > > > > > > > >> On Fri, Feb 22, 2019, 8:35 PM leerho <
> lee...@gmail.com>
> > > > wrote:
> > > > > > > > > > >>
> > > > > > > > > > >> > I'm not sure the attached document made it through.
> > > > > > > > > > >> >
> > > > > > > > > > >> > On Fri, Feb 22, 2019 at 7:28 PM leerho <
> > > lee...@gmail.com>
> > > > > > > wrote:
> > > > > > > > > > >> >
> > > > > > > > > > >> > >
> > > > > > > > > > >> > >
> > > > > > > > > > >> >
> > > > > > > > > > >>
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > > > > > To unsubscribe, e-mail:
> > > > general-unsubscr...@incubator.apache.org
> > > > > > > > > > > For additional commands, e-mail:
> > > > > > general-h...@incubator.apache.org
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > ---------------------------------------------------------------------
> > > > > > > To unsubscribe, e-mail:
> general-unsubscr...@incubator.apache.org
> > > > > > > For additional commands, e-mail:
> > general-h...@incubator.apache.org
> > > > > > >
> > > > > > >
> > > > > >
> > > > > --
> > > > > From my cell phone.
> > > > >
> > > >
> > > > ---------------------------------------------------------------------
> > > > To unsubscribe, e-mail: general-unsubscr...@incubator.apache.org
> > > > For additional commands, e-mail: general-h...@incubator.apache.org
> > > >
> > > >
> > >
> >
>
-- 
>From my cell phone.

Re: DataSketches Proposal - Google Docs Link

Reply via email to