[incubator-datasketches-website] branch asf-site updated: Automatic Site Publish by Buildbot

git-site-role Fri, 24 Jan 2020 18:13:23 -0800

This is an automated email from the ASF dual-hosted git repository.

git-site-role pushed a commit to branch asf-site
in repository 
https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git



The following commit(s) were added to refs/heads/asf-site by this push:
     new 1a58e2a  Automatic Site Publish by Buildbot
1a58e2a is described below

commit 1a58e2a614e3cb69d2d1f405220468daa1927e36
Author: buildbot <[email protected]>
AuthorDate: Sat Jan 25 02:13:02 2020 +0000

    Automatic Site Publish by Buildbot
---
 output/docs/MajorSketchFamilies.html | 60 +++++++++++++++++++++++++++++++-----
 1 file changed, 52 insertions(+), 8 deletions(-)

diff --git a/output/docs/MajorSketchFamilies.html 
b/output/docs/MajorSketchFamilies.html
index 4f4e3f2..2e47681 100644
--- a/output/docs/MajorSketchFamilies.html
+++ b/output/docs/MajorSketchFamilies.html
@@ -464,6 +464,11 @@
     specific language governing permissions and limitations
     under the License.
 -->
+<h1 id="cardinality-sketches">Cardinality Sketches</h1>
+
+<h2 
id="cpc-sketch-estimating-stream-cardinalities-more-efficiently-than-the-famous-hll-sketch">CPC
 Sketch: Estimating Stream Cardinalities more efficiently than the famous HLL 
sketch!</h2>
+<p>This sketch was developed by the late Keven Lang, our chief scientist at 
the time, is an amazing <em>tour de force</em> of scientific design and 
engineering and has substantially better accuracy / per stored size than the 
famous HLL sketch. The theory and demonstration of its performance is detailed 
in Lang’s paper <a href="https://arxiv.org/abs/1708.06839";>Back to the Future: 
an Even More Nearly Optimal Cardinality Estimation Algorithm</a>.  This sketch 
is available in Java, C++ and  [...]
+
 <h2 id="theta-sketches-estimating-stream-expression-cardinalities"><a 
href="/docs/Theta/ThetaSketchFramework.html">Theta Sketches</a>: Estimating 
Stream Expression Cardinalities</h2>
 <p>Internet content, search and media companies like Yahoo, Google, Facebook, 
etc., collect many tens of billions of event records from the many millions of 
users to their web sites each day.  These events can be classified by many 
different dimensions, such as the page visited and user location and profile 
information.  Each event also contains some unique identifiers associated with 
the user, specific device (cell phone, tablet, or computer) and the web browser 
used.</p>
 
@@ -474,31 +479,70 @@
 <p>Computing cardinalities with massive data requires lots of computer 
resources and time.
 However, if an approximate answer to these problems is acceptable, <a 
href="/docs/Theta/ThetaSketchFramework.html">Theta Sketches</a> can provide 
reasonable estimates, in a single pass, orders of magnitude faster, even fast 
enough for analysis in near-real time.</p>
 
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/theta/Sketch.java";>theta/Sketch</a>
 can operate both on-heap and off-heap, has powerful Union, Intersection, AnotB 
and Jaccard operators, has a high-performance concurrent form for 
multi-threaded environments, has both immutable compact, and updatable 
representations, and is quite fast. It is available in Java, C++ and Python. 
Because of its flexibility, it is one of th [...]
+
+<h2 
id="tuple-sketches-extending-theta-sketches-to-perform-associative-analysis"><a 
href="/docs/Tuple/TupleOverview.html">Tuple Sketches</a>: Extending Theta 
Sketches to Perform Associative Analysis</h2>
+<p>It is often not enough to perform stream expressions on sets of unique 
identifiers, it is very valuable to be able to associate additive data with 
these identifiers, such as impression counts, clicks or timestamps.  Tuple 
Sketches are a natural extension of the Theta sketch and have Java Genric 
forms, that enable the user do define the sketch with arbitrary “summary” data. 
 The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasket
 [...]
+
+<p>The Tuple sketch is effectively infinitely extendable and there are several 
common variants of the Tuple Sketch, which also serve as examples on how to 
extend the base classes, that are also in the library, including:</p>
+
+<ul>
+  <li><a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/adouble/DoubleSketch.java";>tuple/adouble/DoubleSketch</a>
 with a single column of <em>double</em> values as the <em>summary</em>.</li>
+  <li><a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/aninteger/IntegerSketch.java";>tuple/aninteger/IntegerSketch</a>
 with a single column of <em>int</em> values as the <em>summary</em>.</li>
+  <li><a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/strings/ArrayOfStringsSketch.java";>tuple/strings/ArrayOfStringsSketch</a>,
 which is effectively a variable number of columns of strings as the 
<em>summary</em>.</li>
+  <li><a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/tuple/ArrayOfDoublesSketch.java";>tuple/ArrayOfDoublesSketch</a>,
 which is effectively a variable number of columns of double values as the 
<em>summary</em>. This variant also provides both on-heap and off-heap 
operation.</li>
+</ul>
+
 <h2 id="hyperloglog-sketches-estimating-stream-cardinalities"><a 
href="/docs/HLL/HLL.html">HyperLogLog Sketches</a>: Estimating Stream 
Cardinalities</h2>
 <p>The HyperLogLog (HLL) is a cardinality sketch similar to the above Theta 
sketches except they are anywhere from 2 to 16 times smaller in size.  The HLL 
sketches can be Unioned, but set intersection and difference operations are not 
provided intrinsically, because the resulting error would be quite poor.  If 
your application only requires cardinality estimation and Unioning and space is 
at a premium, the HLL sketch provided could be your best choice.</p>
 
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/hll/HllSketch.java";>hll/HllSketch</a>
 can operate both on-heap and off-heap, provides the Union operators, and has 
both immutable compact, and updatable representations. It is available in Java, 
C++ and Python.</p>
+
 <h2 
id="hyperloglog-map-sketch-estimating-stream-cardinalities-of-key-value-pairs"><a
 href="/docs/HLL/HllMap.html">HyperLogLog Map Sketch</a>: Estimating Stream 
Cardinalities of Key-Value Pairs</h2>
-<p>This is a specially designed sketch that addresses the problem of 
individually tracking value cardinalities of Key-Value (K,V) pairs, where the 
number of keys can be very large, such as IP addresses, or Geo identifiers, 
etc. Assigning individual sketches to each key would create unnecessary 
overhead. This sketch streamlines the process with much better space 
management.</p>
+<p>This is a specially designed sketch that addresses the problem of 
individually tracking value cardinalities of Key-Value (K,V) pairs in 
real-time, where the number of keys can be very large, such as IP addresses, or 
Geo identifiers, etc. Assigning individual sketches to each key would create 
unnecessary overhead. This sketch streamlines the process with much better 
space management.  This <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apac
 [...]
+
+<h1 id="quantiles-sketches">Quantiles Sketches</h1>
 
 <h2 
id="quantiles-sketches-estimating-distributions-from-a-stream-of-values"><a 
href="/docs/Quantiles/QuantilesOverview.html">Quantiles Sketches</a>: 
Estimating Distributions from a Stream of Values</h2>
 <p>There are many situations where is valuable to understand the distribution 
of values in a stream. For example, from a stream of web-page time-spent 
values, it would be useful to know arbitrary quantiles of the distribution, 
such as the 25th percentile value, the median value and the 75th percentile 
value. The <a href="/docs/Quantiles/QuantilesOverview.html">Quantiles 
Sketches</a> solve this problem and enable the inverse functions such as the 
Probability Mass Function (PMF) and the Cu [...]
 
 <p><img class="doc-img-full" src="/docs/img/quantiles/TimeSpentHistogram.png" 
alt="TimeSpentHistogram" /></p>
 
-<h2 
id="frequent-items-sketches-finding-the-heavy-hitter-objects-from-a-stream"><a 
href="/docs/FrequentItems/FrequentItemsOverview.html">Frequent Items 
Sketches</a>: Finding the Heavy Hitter Objects from a Stream</h2>
+<p>There are two different families of quantiles sketches, the original <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/quantiles/DoublesSketch.java";>quantiles/DoublesSketch</a>,
 which can be operated either on-heap or off-heap, and is also available in a 
Java Generic form for arbitrary comparable objects. It is only available in 
Java.</p>
+
+<p>Later we developed the <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/kll/KllFloatsSketch.java";>kll/KllFloatsSketch</a>
  (Named after its authors), which is also a quantiles sketch, that achieves 
near optimal small size for a given accuracy. It is only available on-heap. It 
is available in Java, C++ and Python.</p>
+
+<h1 id="frequent-items--heavy-hitters-sketches">Frequent Items / Heavy Hitters 
Sketches</h1>
+
+<h2 
id="frequent-items-sketches-finding-the-heavy-hitter-objects-from-a-stream"><a 
href="/docs/Frequency/FrequentItemsOverview.html">Frequent Items Sketches</a>: 
Finding the Heavy Hitter Objects from a Stream</h2>
 <p>It is very useful to be able to scan a stream of objects, such as song 
titles, and be able to quickly identify those items that occur most frequently. 
 The term <i>Heavy Hitter</i> is defined to be an item that occurs more 
frequently than some fractional share of the overall count of items
 in the stream including duplicates.  Suppose you have a stream of 1M song 
titles, but in that stream there are only 100K song titles that are unique. If 
any single title consumes more than 10% of the stream elements it is a Heavy 
Hitter, and the 10% is a threshold parameter we call epsilon.</p>
 
-<p>The accuracy of a Frequent Items Sketch is proportional to the configured 
size of the sketch, the larger the sketch, the smaller is the epsilon threshold 
that can detect Heavy Hitters.</p>
+<p>The accuracy of a Frequent Items Sketch is proportional to the configured 
size of the sketch, the larger the sketch, the smaller is the epsilon threshold 
that can detect Heavy Hitters. This sketch is available in two forms, as the <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/frequencies/LongsSketch.java";>frequencies/LongsSketch</a>
 used for processing a stream of tuples {<em>long</em>, weight}, and the <a 
href="https: [...]
 
-<h2 
id="tuple-sketches-extending-theta-sketches-to-perform-associative-analysis"><a 
href="/docs/Tuple/TupleOverview.html">Tuple Sketches</a>: Extending Theta 
Sketches to Perform Associative Analysis</h2>
-<p>It is often not enough to perform stream expressions on sets of unique 
identifiers, it is very valuable to be able to associate additive data with 
these identifiers, such as impression counts or clicks.  Tuple Sketches are a 
recent addition to the library and can be extended with arbitrary “summary” 
data.</p>
+<h2 
id="frequent-distinct-tuples-sketch-finding-the-heavy-hitter-tuples-from-a-stream"><a
 href="/docs/Frequency/FrequentDistinctTuplesSketch.html">Frequent Distinct 
Tuples Sketch</a>: Finding the Heavy Hitter tuples from a Stream.</h2>
+<p>Suppose our data is a stream of pairs {IP address, User ID} and we want to 
identify the IP addresses that have the most distinct User IDs. Or conversely, 
we would like to identify the User IDs that have the most distinct IP 
addresses. This is a common challenge in the analysis of big data and the FDT 
sketch helps solve this problem using probabilistic techniques.</p>
 
-<h2 
id="sampling-sketches-uniform-sampling-of-a-stream-into-a-fixed-size-space"><a 
href="/docs/Sampling/ReservoirSampling.html">Sampling Sketches</a>: Uniform 
Sampling of a Stream into a fixed size space</h2>
-<p>This implements the famous Reservoir sampling algorithm and extends it with 
the capabilities that large-scale distributed systems really need: mergability 
(even with different sized sketches), uses Java Generics so that the base 
classes can be trivially extended for any input type (even polymorphic types), 
and an extensible means of performing serialization and deserialization. VarOpt 
sampling extends the family to weighted sampling, additionally providing subset 
sum estimates from th [...]
+<p>More generally, given a multiset of tuples with <em>N</em> dimensions 
<em>{d1,d2, d3, …, dN}</em>, and a primary subset of dimensions <em>M &lt; 
N</em>, our task is to identify the combinations of <em>M</em> subset 
dimensions that have the most frequent number of distinct combinations of the 
<em>N - M</em> non-primary dimensions.</p>
+
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/fdt/FdtSketch.java";>fdt/FdtSketch</a>
 is currently only available in Java, but because it is an extension of the 
Tuple Sketch family, it inherits many of the same properties: it can operate 
both on-heap and off-heap, includes both Union and Intersection operators, has 
both immutable compact, and updatable representations.</p>
 
 <h2 
id="frequent-directions-distributed-mergeable-singular-value-decomposition">Frequent
 Directions: Distributed, mergeable Singular Value Decomposition</h2>
-<p>Part of a new separate sketches-vector package, Frequent Directions is in 
many ways a generalization of the Frequent Items sketch to handle vector data. 
This sketch computes an approximate singular value decomposition (SVD) of a 
matrix, providing a projection matrix that can be used for dimensionality 
reduction. SVD is a key technique in many recommender systems, providing 
shopping suggestions based on a customer’s past purchases compared with other 
similar customers.</p>
+<p>Part of a new separate sketches-vector package, Frequent Directions is in 
many ways a generalization of the Frequent Items sketch to handle vector data. 
This sketch computes an approximate singular value decomposition (SVD) of a 
matrix, providing a projection matrix that can be used for dimensionality 
reduction. SVD is a key technique in many recommender systems, providing 
shopping suggestions based on a customer’s past purchases compared with other 
similar customers. This sketch is s [...]
+
+<h1 id="sampling-sketches">Sampling Sketches</h1>
+
+<h2 
id="sampling-sketches-uniform-sampling-of-a-stream-into-a-fixed-size-space"><a 
href="/docs/Sampling/ReservoirSampling.html">Sampling Sketches</a>: Uniform 
Sampling of a Stream into a fixed size space</h2>
+<p>This family of sketches implements an enhanced version of the famous 
Reservoir sampling algorithm and extends it with the capabilities that 
large-scale distributed systems really need: mergability (even with different 
sized sketches), uses Java Generics so that the base classes can be trivially 
extended for any input type (even polymorphic types), and an extensible means 
of performing serialization and deserialization.</p>
+
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/sampling/ReservoirLongsSketch.java";>sampling/ReservoirLongsSketch</a>
 accepts a stream of <em>long</em> values as identifiers with a weight of one, 
and produces a result Reservoir of a pre-determined size that represents a 
uniform random sample of the stream.</p>
+
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/sampling/ReservoirItemsSketch.java";>sampling/ReservoirItemsSketch</a>
 accepts a stream of type <em>T</em> as identifiers with a weight of one, and 
produces a result Reservoir of a pre-determined size that represents a uniform 
random sample of the stream.</p>
+
+<p>The <a 
href="https://github.com/apache/incubator-datasketches-java/blob/master/src/main/java/org/apache/datasketches/sampling/VarOptItemsSketch.java";>sampling/VarOptItemsSketch</a>
 extends the Reservoir family to weighted sampling, additionally providing 
subset sum estimates from the sample with provably optimal variance.</p>
+
+<p>This family is currently only available in Java.</p>
+
 
       </div>
     </div>


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[incubator-datasketches-website] branch asf-site updated: Automatic Site Publish by Buildbot

Reply via email to