This is an automated email from the ASF dual-hosted git repository.
git-site-role pushed a commit to branch asf-site
in repository
https://gitbox.apache.org/repos/asf/incubator-datasketches-website.git
The following commit(s) were added to refs/heads/asf-site by this push:
new 4fc0d6f Automatic Site Publish by Buildbot
4fc0d6f is described below
commit 4fc0d6ff58217c848be04fd45c4d9d5934fa0891
Author: buildbot <[email protected]>
AuthorDate: Sat Aug 22 06:42:56 2020 +0000
Automatic Site Publish by Buildbot
---
output/docs/Community/KDD_Tutorial_Summary.html | 13 +-
.../img/Community/KDD_sketching_tutorial_pt1.pdf | Bin 0 -> 15455179 bytes
.../img/Community/KDD_sketching_tutorial_pt2.pdf | Bin 0 -> 754785 bytes
.../docs/img/Community/KLL_Sketch_Tutorial.ipynb | 518 +++++++++++++++++++++
.../docs/img/Community/Theta_Sketch_Tutorial.ipynb | 329 +++++++++++++
output/docs/img/Community/Untitled.ipynb | 493 ++++++++++++++++++++
6 files changed, 1352 insertions(+), 1 deletion(-)
diff --git a/output/docs/Community/KDD_Tutorial_Summary.html
b/output/docs/Community/KDD_Tutorial_Summary.html
index 836eec6..69106b0 100644
--- a/output/docs/Community/KDD_Tutorial_Summary.html
+++ b/output/docs/Community/KDD_Tutorial_Summary.html
@@ -57,7 +57,7 @@
under the License.
-->
-<h1 id="data-sketching-for-real-time-analyticstheory-and-practice">Data
Sketching for Real Time Analytics:<br />Theory and Practice</h1>
+<h1 id="data-sketching-for-real-time-analytics-theory-and-practice">Data
Sketching for Real Time Analytics: Theory and Practice</h1>
<h2 id="abstract">Abstract</h2>
@@ -71,6 +71,17 @@
<p>The audience is expected to have a familiarity of probability and
statistics that is typical for an undergraduate mathematical statistics or
introductory graduate machine learning course.</p>
+<h2 id="materials">Materials</h2>
+
+<p>In addition to the prerecorded presentations, the slides and Jupyter
notebooks are available. Note that the KLL notebook uses an update method that
is only available in release candidate v2.1.0 but as of the tutorial date is
not quite available in an official release (the latest is 2.0.0).</p>
+
+<ul>
+ <li>Slides: <a
href="/docs/img/Community/KDD_sketching_tutorial_pt1.pdf">Theory (part
1)</a></li>
+ <li>Slides: <a
href="/docs/img/Community/KDD_sketching_tutorial_pt2.pdf">Practice (part
2)</a></li>
+ <li>Notebook: <a href="/docs/img/Community/KLL_Sketch_tutorial.ipynb">KLL
Sketch</a></li>
+ <li>Theta SketchL: <a
href="/docs/img/Community/Theta_Sketch_tutorial.ipynb">Theta Sketch</a></li>
+</ul>
+
<h2 id="outline">Outline</h2>
<p>The tutorial will consist of two parts. The first focuses on methods and
theory for data sketching and sampling. The second focuses on application and
includes code examples using the Apache DataSketches project.</p>
diff --git a/output/docs/img/Community/KDD_sketching_tutorial_pt1.pdf
b/output/docs/img/Community/KDD_sketching_tutorial_pt1.pdf
new file mode 100644
index 0000000..a078c47
Binary files /dev/null and
b/output/docs/img/Community/KDD_sketching_tutorial_pt1.pdf differ
diff --git a/output/docs/img/Community/KDD_sketching_tutorial_pt2.pdf
b/output/docs/img/Community/KDD_sketching_tutorial_pt2.pdf
new file mode 100644
index 0000000..3990c27
Binary files /dev/null and
b/output/docs/img/Community/KDD_sketching_tutorial_pt2.pdf differ
diff --git a/output/docs/img/Community/KLL_Sketch_Tutorial.ipynb
b/output/docs/img/Community/KLL_Sketch_Tutorial.ipynb
new file mode 100644
index 0000000..418ce84
--- /dev/null
+++ b/output/docs/img/Community/KLL_Sketch_Tutorial.ipynb
@@ -0,0 +1,518 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# KLL Sketch Tutorial"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Table of Contents\n",
+ "\n",
+ " * [Overview](#Overview)\n",
+ " * [Set-up](#Set-up)\n",
+ " * [Creating a KLL Sketch](#Creating-a-KLL-Sketch)\n",
+ " * [Querying the sketch](#Querying-the-sketch)\n",
+ " * [Merging Sketches](#Merging-Sketches)\n",
+ " * [Serializing Sketches for
Transportation](#Serializing-Sketches-for-Transportation)\n",
+ " * [Using in a Data Cube](#Using-in-a-Data-Cube)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Overview\n",
+ "\n",
+ "This tutorual will focus on the KLL sketch. We will demonstrate how to
create and feed data into sketches, and also show an option for moving sketches
between systems. We will rely on synthetic data to help us better reason about
expected results when visualizing."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Set-up\n",
+ "\n",
+ "This tutorial assumes you have already downloaded and installed the
python wrapper for the DataSketches library. See the [DataSketches
Downloads](http://datasketches.apache.org/docs/Community/Downloads.html) page
for details"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasketches import kll_floats_sketch\n",
+ "\n",
+ "import base64\n",
+ "import numpy as np\n",
+ "import pandas as pd\n",
+ "\n",
+ "import matplotlib.pyplot as plt\n",
+ "import seaborn as sns\n",
+ "%matplotlib inline\n",
+ "sns.set(color_codes=True)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {
+ "slideshow": {
+ "slide_type": "fragment"
+ }
+ },
+ "source": [
+ "### Creating a KLL Sketch\n",
+ "\n",
+ "Sketch creation is simple: As with all the sketches in the library, you
simply need to decide on your error tolerance, which determines the maximum
size of the sketch. The DataSketches library refers to that value as $k$.\n",
+ "\n",
+ "We can get an estimate of the expected error bound (99th percentile)
without instantiating anything."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(kll_floats_sketch.get_normalized_rank_error(160, False))\n",
+ "print(kll_floats_sketch.get_normalized_rank_error(200, False))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "As we can see, the (one-sided) error with $k=160$ is about 1.67% versus
1.33% at $k=200$. For the rest of the examples, we will use $200$. We can now
instantiate a sketch."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "k = 200\n",
+ "sk = kll_floats_sketch(k)\n",
+ "print(sk)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The sketch has seen no data so far (N=0) and is consequently storing
nothing (Retained items=0). Storage bytes refers to how much space would be
required to save the sketch as an array of bytes, which in this case is fairly
minimal.\n",
+ "Next, we can add some data."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sk.update(np.random.exponential(size=150))\n",
+ "print(sk)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We added 150 samples, which is few enough that the sketch is still in
exact mode, meaning it is storing everything rather than sampling. To be able
to compare the sketch to an exact computation, we will generate new data -- and
a lot more of it. We will also create a sketch with a much larger $k$ to
demonstrate the effect of increasing the sketch size."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sk = kll_floats_sketch(k)\n",
+ "sk_large = kll_floats_sketch(10*k)\n",
+ "data = np.random.exponential(size=2**24)\n",
+ "sk.update(data)\n",
+ "sk_large.update(data)\n",
+ "print(sk)\n",
+ "print(sk_large)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here the sketch is well into sampling territory, having processed nearly
17 million items. We can see that the sketch is retaining only 645 items. The
2676 bytes of storage compares to 64MB for raw data using 4-byte floats. Next
we will start querying the sketch to better understand the performance. Even
the much larger sketch uses less 24k bytes with fewer than 6000 points."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Querying the sketch\n",
+ "\n",
+ "The median for an exponential distribution is $\\frac{ln 2}{\\lambda}$,
and the default numpy exponential distribution has $\\lambda = 1.0$, so the
median should be close to $0.693$. Similarly, if we ask for the rank $ln 2$, we
should get a value close to $0.5$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f'Theoretical median : {np.log(2):.6f}')\n",
+ "print(f'Estimated median, k=200 : {sk.get_quantile(0.5):.6f}')\n",
+ "print(f'Estimated median, k=2000 :
{sk_large.get_quantile(0.5):.6f}')\n",
+ "print('')\n",
+ "print(f'Exact Quantile of ln(2) : 0.5')\n",
+ "print(f'Est. Quantile of ln(2), k=200 :
{sk.get_rank(np.log(2)):.6f}')\n",
+ "print(f'Est. Quantile of ln(2), k=2000 :
{sk_large.get_rank(np.log(2)):.6f}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "One of the common use cases of a quantiles sketch like KLL is visualizing
data with a histogram. We can create one from the sketch easily enough, but for
this tutorial we also want to know how well we are doing. Fortunately, we can
still compute a histogram on this data directly for comparison.\n",
+ "\n",
+ "Note that the sketch returns a PMF while the histogram computes data only
for the bins between the provided points and must be converted to a PMF. The
sketch also returns a bin containing all the mass less than the minimum
provided point. In this case that will always be 0, so we discard it for
plotting."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xmin = 0 # could use sk.get_min_value() but we know the bound here\n",
+ "xmax = sk.get_quantile(0.99995) # this will exclude a little data from
the exact distribution\n",
+ "num_splits = 40\n",
+ "step = (xmax - xmin) / num_splits\n",
+ "splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)\n",
+ "\n",
+ "pmf = sk.get_pmf(splits)[1:]\n",
+ "pmf_large = sk_large.get_pmf(splits)[1:]\n",
+ "exact_pmf = np.histogram(data, bins=x)[0] / sk.get_n()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.figure(figsize=(12,12))\n",
+ "plt.subplot(2,1,1)\n",
+ "plt.title('PMF, k = 200')\n",
+ "plt.ylabel('Probability')\n",
+ "plt.bar(x=splits, height=pmf, align='edge', width=-.07, color='blue')\n",
+ "plt.bar(x=splits, height=exact_pmf, align='edge', width=.07,
color='red')\n",
+ "plt.legend(['KLL, k=200','Exact'])\n",
+ "\n",
+ "plt.subplot(2,1,2)\n",
+ "plt.title('PMF, k = 2000')\n",
+ "plt.ylabel('Probability')\n",
+ "plt.bar(x=splits, height=pmf_large, align='edge', width=-.07,
color='blue')\n",
+ "plt.bar(x=splits, height=exact_pmf, align='edge', width=.07,
color='red')\n",
+ "plt.legend(['KLL, k=2000','Exact'])"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The sketch with $k=200$ clearly provides a good approximation. In the
case of this exponential distribution, we sometimes observe that there is
additional mass near the right edge of the tail compared to the true PMF,
although still within the provided error bound with high probability. While
this is not problematic given the guarantees of the sketch, certain use cases
requiring high precision at extreme quantiles may find it less satisfactory.
With the larger $k=2000$ sketch, the a [...]
+ "\n",
+ "We will eventaully provide what we call a Relative Error Quantiles sketch
that will have tighter error bounds as you approach the tail of the
distribution, which will be useful if you care primarily about accuray in the
tail, but that will require a larger sketch."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Merging sketches\n",
+ "\n",
+ "A single sketch is certainly useful, but the real power of sketches comes
from the ability to merge them. Here, we will create two simple sketches to
demonstrate. For good measure, we'll use different values of $k$ for the
sketches, as well as feed them different numbers of points."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sk1 = kll_floats_sketch(k)\n",
+ "sk2 = kll_floats_sketch(int(1.5 * k))\n",
+ "\n",
+ "data1 = np.random.normal(loc=-2.0, size=2**24)\n",
+ "data2 = np.random.normal(loc=2.0, size=2**25)\n",
+ "sk1.update(data1)\n",
+ "sk2.update(data2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "With the KLL sketch, there is no separate object for unions. We can
either create another empty sketch and use that as a merge target or we can
merge sketch 2 into sketch 1. Taking the latter approach and plotting the
resulting histogram gives us the expected distribution. Note that one sketch
has twice as many points as the other."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sk1.merge(sk2)\n",
+ "print(sk1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We saved the input data so that we can again compute an exact
distribution."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xmin = sk1.get_min_value()\n",
+ "xmax = sk1.get_max_value()\n",
+ "num_splits = 20\n",
+ "step = (xmax - xmin) / num_splits\n",
+ "splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)\n",
+ "\n",
+ "pmf = sk1.get_pmf(splits)[1:]\n",
+ "cdf = sk1.get_cdf(splits)[1:]\n",
+ "exact_pmf = (np.histogram(data1, bins=x)[0] + np.histogram(data2,
bins=x)[0]) / sk1.get_n()"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "plt.figure(figsize=(12,6))\n",
+ "plt.subplot(1,2,1)\n",
+ "plt.bar(x=splits, height=pmf, align='edge', width=-.3, color='blue')\n",
+ "plt.bar(x=splits, height=exact_pmf, align='edge', width=.3,
color='red')\n",
+ "plt.legend(['KLL','Exact'])\n",
+ "plt.ylabel('Probability')\n",
+ "plt.title('Merged PMF')\n",
+ "\n",
+ "plt.subplot(1,2,2)\n",
+ "plt.bar(x=splits, height=cdf, align='edge', width=-.3, color='blue')\n",
+ "plt.bar(x=splits, height=np.cumsum(exact_pmf), align='edge', width=.3,
color='red')\n",
+ "plt.legend(['KLL','Exact'])\n",
+ "plt.ylabel('Probability')\n",
+ "plt.title('Merged CDF')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Notice that we do not need to do anything special to merge the sketches
despite the different values of $k$, and the 2:1 relative ratio of weights of
the two distributions was preserved despite the input sketch size difference."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Serializing Sketches for Transportation\n",
+ "\n",
+ "Being able to move sketches between platforms is important. One of the
useful aspects of the DataSketches library in particular is binary
compatibility across languages. While this section will remain within python,
sketches serialized from C++- or Java-based systems would work identically.\n",
+ "\n",
+ "In this section, we will start by creating a tab-separated file with a
handfull\n",
+ "of sketches and then load it in as a dataframe. We will encode each
binary sketch image as base64.\n",
+ "\n",
+ "To simplify sketch creation, the first step will be to define a simple
function to generate a line for the file with the given parameters."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "def generate_sketch(family: str, n: int, mean: float, var: float) ->
str:\n",
+ " sk = kll_floats_sketch(200)\n",
+ " if (family == 'normal'):\n",
+ " sk.update(np.random.normal(loc=mean, scale=var, size=n))\n",
+ " elif (family == 'uniform'):\n",
+ " b = mean + np.sqrt(3 * var)\n",
+ " a = 2 * mean - b\n",
+ " sk.update(np.random.uniform(low=a, high=b, size=n))\n",
+ " else:\n",
+ " return None\n",
+ " sk_b64 = base64.b64encode(sk.serialize()).decode('utf-8')\n",
+ " return f'{family}\\t{n}\\t{mean}\\t{var}\\t{sk_b64}\\n'"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "filename = 'kll_tutorial.tsv'\n",
+ "with open(filename, 'w') as f:\n",
+ " f.write('family\\tn\\tmean\\tvariance\\tkll\\n')\n",
+ " f.write(generate_sketch('normal', 2**23, -4.0, 0.5))\n",
+ " f.write(generate_sketch('normal', 2**24, 0.0, 1.0))\n",
+ " f.write(generate_sketch('normal', 2**25, 2.0, 0.5))\n",
+ " f.write(generate_sketch('normal', 2**23, 4.0, 0.2))\n",
+ " f.write(generate_sketch('normal', 2**22, -2.0, 2.0))\n",
+ " f.write(generate_sketch('uniform', 2**21, 0.5, 1.0/12))\n",
+ " f.write(generate_sketch('uniform', 2**22, 5.0, 1.0/12))\n",
+ " f.write(generate_sketch('uniform', 2**20, -0.5, 1.0/3))\n",
+ " f.write(generate_sketch('uniform', 2**23, 0.0, 4.0/3))\n",
+ " f.write(generate_sketch('uniform', 2**22, -4.0, 1.0/3))
"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "If your system has a *nix shell, you can inspect the resulting file:"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "!head -2 {filename}"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Using in a Data Cube"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Now that we have our file with 10 sketches, we can use pandas to load
them in. To ensure that we load the sketches as useful objects, we need to
define a converter function."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "deserialize_kll = lambda x :
kll_floats_sketch.deserialize(base64.b64decode(x))\n",
+ "\n",
+ "df = pd.read_csv(filename,\n",
+ " sep='\\t',\n",
+ " header=0,\n",
+ " dtype={'family':'category', 'n':int, 'mean':float,
'var':float},\n",
+ " converters={'kll':deserialize_kll}\n",
+ " )\n",
+ "print(df)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The sketch column is represented by the string equivalent, which is not
very useful for viewing here but does show that the column contains the actual
objects."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "And finally, we can now perform queries on the results."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "query_result = kll_floats_sketch(10*k)\n",
+ "for sk in df.loc[df['family'] == 'normal'].itertuples(index=False):\n",
+ " query_result.merge(sk.kll)\n",
+ "print(query_result)\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Here we see that the resulting sketch has processed 71 million items
(272MB of data) and is summarizing it using 563 items, and can be serialized
into only 2352 bytes, which includes some sketch metadata.\n",
+ "\n",
+ "Finally, we want to visualize this data. Remember that we have a mixture
of 5 Gaussian distributions:\n",
+ "\n",
+ "| $\\mu$ | $\\sigma^2$ | n |\n",
+ "|-----:|----:|---------:|\n",
+ "| -4.0 | 0.5 | $2^{23}$ |\n",
+ "| 0.0 | 1.0 | $2^{24}$ |\n",
+ "| 2.0 | 0.5 | $2^{25}$ |\n",
+ "| 4.0 | 0.2 | $2^{23}$ |\n",
+ "| -2.0 | 2.0 | $2^{22}$ |"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "xmin = query_result.get_quantile(0.0005)\n",
+ "xmax = query_result.get_quantile(0.9995)\n",
+ "num_splits = 50\n",
+ "step = (xmax - xmin) / num_splits\n",
+ "splits = [xmin + (i*step) for i in range(0, num_splits)]\n",
+ "\n",
+ "pmf = query_result.get_pmf(splits)\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)\n",
+ "plt.figure(figsize=(12,6))\n",
+ "plt.title('PMF')\n",
+ "plt.ylabel('Probability')\n",
+ "plt.bar(x=x, height=pmf, align='edge', width=-0.15)"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/output/docs/img/Community/Theta_Sketch_Tutorial.ipynb
b/output/docs/img/Community/Theta_Sketch_Tutorial.ipynb
new file mode 100644
index 0000000..28c6157
--- /dev/null
+++ b/output/docs/img/Community/Theta_Sketch_Tutorial.ipynb
@@ -0,0 +1,329 @@
+{
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Theta Sketch Tutorial\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Table of Contents\n",
+ "\n",
+ " * [Overview](#Overview)\n",
+ " * [Set-up](#Set-up)\n",
+ " * [Basic Sketch Usage](#Basic-Sketch-Usage)\n",
+ " * [Sketch Unions](#Sketch-Unions)\n",
+ " * [Sketch Intersections](#Sketch-Intersections)\n",
+ " * [Set Difference (A-not-B)](#Set-Difference-(A-not-B))"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Overview\n",
+ "\n",
+ "This tutorial covers basic operation of the Theta sketch for distinct
counting. We will demonstrate how to create and feed data into sketches as well
as the various set operations. We will also include the HLL sketch for
comparison.\n",
+ "\n",
+ "Characterization tests of the hash function we use, Murmur3, have shown
that it has excellent independence properties. As a reuslt, we can achieve
reasonable performance for demonstration purposes by feeding in sequential
integers. This lets us experiment with the set operations in a controlled but
still realistic manner, and to know the exact result without resorting to an
expensive computation."
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Set-up\n",
+ "\n",
+ "This tutorial assuems you have already downloaded and installed the
python wrapper for the DataSketches library. See the [DataSketches
Downloads](http://datasketches.apache.org/docs/Community/Downloads.html) page
for details"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasketches import theta_sketch, update_theta_sketch,
compact_theta_sketch\n",
+ "from datasketches import theta_union, theta_intersection,
theta_a_not_b\n",
+ "\n",
+ "from datasketches import hll_sketch, hll_union"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Basic Sketch Usage"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "To start, we'll create a sketch with ~1 million points in order to
demonstrate basic sketch operations."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "n = 2**20\n",
+ "k = 12\n",
+ "sk1 = update_theta_sketch(k)\n",
+ "hll1 = hll_sketch(k)\n",
+ "for i in range(0, n):\n",
+ " sk1.update(i)\n",
+ " hll1.update(i)\n",
+ "print(sk1)\n",
+ "print(hll1)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "The summary contains most data of interest, but we can also query for
specific information. And in this case, since we know the exact number of
distinct items presented to the sketch, we can look at the estimate, upper, and
lower bounds as a percentage of the exact value."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "print(f'Exact result:\\t\\t\\t{n}')\n",
+ "print('')\n",
+ "print(f'Theta upper bound (1 std.
dev):\\t{sk1.get_upper_bound(1):.1f}\\t({100*sk1.get_upper_bound(1) / n -
100:.2f}%)')\n",
+ "print(f'Theta
estimate:\\t\\t\\t{sk1.get_estimate():.1f}\\t({100*sk1.get_estimate() / n -
100:.2f}%)')\n",
+ "print(f'Theta lower bound (1 std.
dev):\\t{sk1.get_lower_bound(1):.1f}\\t({100*sk1.get_lower_bound(1) / n -
100:.2f}%)')\n",
+ "print('')\n",
+ "print(f'HLL upper bound (1 std.
dev):\\t{hll1.get_upper_bound(1):.1f}\\t({100*hll1.get_upper_bound(1) / n -
100:.2f}%)')\n",
+ "print(f'HLL
estimate:\\t\\t\\t{hll1.get_estimate():.1f}\\t({100*hll1.get_estimate() / n -
100:.2f}%)')\n",
+ "print(f'HLL lower bound (1 std.
dev):\\t{hll1.get_lower_bound(1):.1f}\\t({100*hll1.get_lower_bound(1) / n -
100:.2f}%)')\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can serialize and reconstruct the sketch. If we compact the sketch
prior to serialization, we can still query the rebuilt sketch but cannot update
it further. When reconstructed, we can see that the estimate is exactly the
same."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "sk1_bytes = sk1.compact().serialize()\n",
+ "len(sk1_bytes)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "new_sk1 = theta_sketch.deserialize(sk1_bytes)\n",
+ "print(f'Estimate (original):\\t{sk1.get_estimate()}')\n",
+ "print(f'Estimate (new):\\t\\t{new_sk1.get_estimate()}')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Sketch Unions"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Theta Sketch unions make use of a separate union object. The union will
accept input sketches with different values of $k$.\n",
+ "\n",
+ "For this example, we will create a sketch with distinct values that
partially overlap those in `sk1`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "offset = int(15 * n / 16)\n",
+ "sk2 = update_theta_sketch(k+1)\n",
+ "hll2 = hll_sketch(k+1)\n",
+ "for i in range(0, n):\n",
+ " sk2.update(i + offset)\n",
+ " hll2.update(i + offset)\n",
+ "print(sk2)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "We can now feed the sketches into the union. As constructed, the exact
number of unique values presented to the two sketches is $\\frac{31}{16}n$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "union = theta_union(k)\n",
+ "union.update(sk1)\n",
+ "union.update(sk2)\n",
+ "\n",
+ "union_hll = hll_union(k)\n",
+ "union_hll.update(hll1)\n",
+ "union_hll.update(hll2)\n",
+ "\n",
+ "exact = int(31 * n / 16);\n",
+ "result = union.get_result()\n",
+ "theta_bound_pct = 100 * (result.get_upper_bound(1) -
result.get_estimate()) / exact\n",
+ "\n",
+ "hll_result = union_hll.get_result()\n",
+ "hll_bound_pct = 100 * (hll_result.get_upper_bound(1) -
hll_result.get_estimate()) / exact\n",
+ "\n",
+ "\n",
+ "print(f'Exact result:\\t{exact}')\n",
+ "print(f'Theta Estimate:\\t{result.get_estimate():.1f}
({100*(result.get_estimate()/exact - 1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
+ "print(f'HLL Estimate:\\t{hll_result.get_estimate():.1f}
({100*(hll_result.get_estimate()/exact - 1):.2f}% +- {hll_bound_pct:.2f}%)')\n"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Sketch Intersections"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Beyond unions, theta sketches also support intersctions through the use
of an intersection object. For comparison, we also present the HLL estimate
here using, using the Inclusion-Exclusion formula: $|A \\cup B| = |A| + |B| -
|A \\cap B|$.\n",
+ "\n",
+ "That formula might not seem too bad when intersecting 2 sketches, but as
the number of sketches increases the formula becomes increasingly comples, and
the error compounds rapidly. By comparison, the Theta set operations can be
applied to an arbitrary number of sketches. "
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "intersection = theta_intersection()\n",
+ "intersection.update(sk1)\n",
+ "intersection.update(sk2)\n",
+ "\n",
+ "hll_inter_est = hll1.get_estimate() + hll2.get_estimate() -
hll_result.get_estimate()\n",
+ "\n",
+ "print(\"Has result: \", intersection.has_result())\n",
+ "result = intersection.get_result()\n",
+ "print(result)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "In this case, we expect the sets to have an overlap of $\\frac{1}{16}n$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "exact = int(n / 16)\n",
+ "theta_bound_pct = 100 * (result.get_upper_bound(1) -
result.get_estimate()) / exact\n",
+ "\n",
+ "print(f'Exact result:\\t\\t{exact}')\n",
+ "print(f'Theta Estimate:\\t\\t{result.get_estimate():.1f}
({100*(result.get_estimate()/exact - 1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
+ "print(f'HLL Estimate:\\t\\t{hll_inter_est:.1f} ({100*(hll_inter_est/exact
- 1):.2f}%)')"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Set Difference (A-not-B)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Finally, we have the set difference operation. Unlike `theta_union` and
`theta_intersection`, `theta_a_not_b` is currently stateless: The object takes
as input 2 sketches at a time, namely $a$ and $b$, and directly returns the
result as a sketch."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "anb = theta_a_not_b()\n",
+ "result = anb.compute(sk1, sk2)\n",
+ "print(result)"
+ ]
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "By using the same two sketches as before, the expected result here is
$\\frac{15}{16}n$.\n",
+ "\n",
+ "Our HLL estimate comes from manipulating the Inclusion-Exclusion formula
above to obtain $|A| - |A \\cap B| = |A \\cup B| - |B|$."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "exact = int(15 * n /16)\n",
+ "theta_bound_pct = 100 * (result.get_upper_bound(1) -
result.get_estimate()) / exact\n",
+ "hll_diff_est = hll_result.get_estimate() - hll2.get_estimate()\n",
+ "\n",
+ "print(f'Exact result:\\t{exact}')\n",
+ "print(f'Theta estimate:\\t{result.get_estimate():.1f}
({100*(result.get_estimate()/exact -1):.2f}% +- {theta_bound_pct:.2f}%)')\n",
+ "print(f'HLL estimate:\\t{hll_diff_est:.1f} ({100*(hll_diff_est/exact -
1):.2f}%)')"
+ ]
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
diff --git a/output/docs/img/Community/Untitled.ipynb
b/output/docs/img/Community/Untitled.ipynb
new file mode 100644
index 0000000..2eab621
--- /dev/null
+++ b/output/docs/img/Community/Untitled.ipynb
@@ -0,0 +1,493 @@
+{
+ "cells": [
+ {
+ "cell_type": "code",
+ "execution_count": 3,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "from datasketches import *\n",
+ "import pandas as pd\n",
+ "import base64"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 4,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "deserialize_kll = lambda x :
kll_floats_sketch.deserialize(base64.b64decode(x))\n",
+ "\n",
+ "df = pd.read_csv(\"1hr.kll.k140.txt\",\n",
+ " sep=\"\\t\",\n",
+ " header=None,\n",
+ " names=['pty','device','kll'],\n",
+ " dtype={'pty':'category', 'device':'category'},\n",
+ " converters={'kll':deserialize_kll}\n",
+ " )"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 5,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/html": [
+ "<div>\n",
+ "<style scoped>\n",
+ " .dataframe tbody tr th:only-of-type {\n",
+ " vertical-align: middle;\n",
+ " }\n",
+ "\n",
+ " .dataframe tbody tr th {\n",
+ " vertical-align: top;\n",
+ " }\n",
+ "\n",
+ " .dataframe thead th {\n",
+ " text-align: right;\n",
+ " }\n",
+ "</style>\n",
+ "<table border=\"1\" class=\"dataframe\">\n",
+ " <thead>\n",
+ " <tr style=\"text-align: right;\">\n",
+ " <th></th>\n",
+ " <th>pty</th>\n",
+ " <th>device</th>\n",
+ " <th>kll</th>\n",
+ " </tr>\n",
+ " </thead>\n",
+ " <tbody>\n",
+ " <tr>\n",
+ " <th>0</th>\n",
+ " <td>mail</td>\n",
+ " <td>mobile</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>1</th>\n",
+ " <td>mail</td>\n",
+ " <td>desktop</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>2</th>\n",
+ " <td>news</td>\n",
+ " <td>mobile</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>3</th>\n",
+ " <td>news</td>\n",
+ " <td>desktop</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>4</th>\n",
+ " <td>sports</td>\n",
+ " <td>mobile</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>5</th>\n",
+ " <td>sports</td>\n",
+ " <td>desktop</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>6</th>\n",
+ " <td>finance</td>\n",
+ " <td>mobile</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>7</th>\n",
+ " <td>finance</td>\n",
+ " <td>desktop</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>8</th>\n",
+ " <td>front-page</td>\n",
+ " <td>mobile</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " <tr>\n",
+ " <th>9</th>\n",
+ " <td>front-page</td>\n",
+ " <td>desktop</td>\n",
+ " <td>### KLL sketch summary:\\n K : 1...</td>\n",
+ " </tr>\n",
+ " </tbody>\n",
+ "</table>\n",
+ "</div>"
+ ],
+ "text/plain": [
+ " pty device
kll\n",
+ "0 mail mobile ### KLL sketch summary:\\n K :
1...\n",
+ "1 mail desktop ### KLL sketch summary:\\n K :
1...\n",
+ "2 news mobile ### KLL sketch summary:\\n K :
1...\n",
+ "3 news desktop ### KLL sketch summary:\\n K :
1...\n",
+ "4 sports mobile ### KLL sketch summary:\\n K :
1...\n",
+ "5 sports desktop ### KLL sketch summary:\\n K :
1...\n",
+ "6 finance mobile ### KLL sketch summary:\\n K :
1...\n",
+ "7 finance desktop ### KLL sketch summary:\\n K :
1...\n",
+ "8 front-page mobile ### KLL sketch summary:\\n K :
1...\n",
+ "9 front-page desktop ### KLL sketch summary:\\n K :
1..."
+ ]
+ },
+ "execution_count": 5,
+ "metadata": {},
+ "output_type": "execute_result"
+ }
+ ],
+ "source": [
+ "df"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 6,
+ "metadata": {},
+ "outputs": [
+ {
+ "name": "stdout",
+ "output_type": "stream",
+ "text": [
+ "### KLL sketch summary:\n",
+ " K : 140\n",
+ " min K : 140\n",
+ " M : 8\n",
+ " N : 8651479\n",
+ " Epsilon : 1.88%\n",
+ " Epsilon PMF : 2.31%\n",
+ " Empty : false\n",
+ " Estimation mode: true\n",
+ " Levels : 16\n",
+ " Sorted : false\n",
+ " Capacity items : 466\n",
+ " Retained items : 345\n",
+ " Storage bytes : 1472\n",
+ " Min value : 0\n",
+ " Max value : 5.38e+03\n",
+ "### End sketch summary\n",
+ "\n"
+ ]
+ }
+ ],
+ "source": [
+ "query_result = kll_floats_sketch(140)\n",
+ "for sk in df.loc[df['pty'] != 'news'].itertuples(index=False):\n",
+ " query_result.merge(sk.kll)\n",
+ "print(query_result)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 7,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "#xmin = query_result.get_min_value()\n",
+ "#xmax = query_result.get_max_value()\n",
+ "xmin = 0.001\n",
+ "xmax = query_result.get_quantile(0.95)\n",
+ "num_splits = 50\n",
+ "step = (xmax - xmin) / num_splits\n",
+ "splits = [0.001 + (i*step) for i in range(0, num_splits)]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 8,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pmf = query_result.get_pmf(splits)\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 9,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "import seaborn as sns\n",
+ "import numpy as np\n",
+ "import matplotlib.pyplot as plt\n",
+ "%matplotlib inline\n",
+ "sns.set(color_codes=True)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 10,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<BarContainer object of 51 artists>"
+ ]
+ },
+ "execution_count": 10,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png":
"iVBORw0KGgoAAAANSUhEUgAAAXwAAAD7CAYAAABpJS8eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAbx0lEQVR4nO3dfUyU9wEH8O/BgXqDFmHP3QwxNZubrlPmWhIYM1gr58nLgVqaUVxvzopvXW1JteLLAtK6GqribNFUbLtEccMq4ugc0m6ZSwZZwXWCae3qWjsreofALIeAJzz7w3jxehzPHdzLY3/fT2Li8/59Hsj3nvsdPGhkWZZBRERfe2GhDkBERMHBwiciEgQLn4hIECx8IiJBsPCJiATBwiciEgQLn4hIENpQBxhJd3cvhob8/2sCcXFR6Oy0+32//qT2jGrPB6g/I/ONndozBjtf
[...]
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "fig, ax = plt.subplots()\n",
+ "plt.bar(x=x, height=pmf, align='edge', width=-50)\n",
+ "#ax.set_xscale('log')"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 11,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "num_splits = 50\n",
+ "logstep = np.log10(xmax - 1.0) / num_splits\n",
+ "logsplits = [1.0 + (i*logstep) for i in range(0, num_splits)]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 12,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "pmf = query_result.get_pmf(np.power(logsplits,10))\n",
+ "logx = logsplits.copy()\n",
+ "logx.append(xmax)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 13,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<BarContainer object of 51 artists>"
+ ]
+ },
+ "execution_count": 13,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png":
"iVBORw0KGgoAAAANSUhEUgAAAXwAAAD7CAYAAABpJS8eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAd0klEQVR4nO3df1Cb9QEG8CdYoKV0w3Jvsslpvc5Nhi1j0zuR3eGwTdNSXoOF3mo7o+tKW89Kx3kIjvaqWCtXcZwV7Xls6q0LG5TryOJ5KVtPu7vBzcKc6LV2Mn+spS4JpG4FQ0mad3/0mmsM4c1vot/n8xfvj7x58g33kPu+5H01iqIoICKir7y0uQ5ARETJwcInIhIEC5+ISBAsfCIiQbDwiYgEwcInIhIEC5+ISBDz5jrAbM6fn4TPl5ivCeTmZmN8fCIhx44H5otdqmdkvtilesZk
[...]
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "fig, ax = plt.subplots()\n",
+ "plt.bar(x=x, height=pmf, align='edge', width=-50)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 14,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "kll = kll_floats_sketch(140)\n",
+ "for i in range(0,100000):\n",
+ " kll.update(np.random.exponential())\n",
+ "xmin = kll.get_min_value()\n",
+ "xmax = kll.get_max_value()\n",
+ "step = (xmax - xmin) / 50\n",
+ "splits = [xmin + (i*step) for i in range(0,50)]\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 15,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<BarContainer object of 51 artists>"
+ ]
+ },
+ "execution_count": 15,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png":
"iVBORw0KGgoAAAANSUhEUgAAAYIAAAD7CAYAAABnoJM0AAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3df1DT9/0H8GeAQGGko7Akcm7X9qYnh4reLhuUtbBOaeRHGop4U+iix8QfndXmWiqtWoTD+WMotqtQS3u7a9UpqzZZehix7by1wt2Q1Zqbtlfv1tapTcKPKqGgIfl8//Dbz0oBEyAQ4uf5uPOOz+f9ziev1+jy5PP+JJ/IBEEQQEREkhUW7AKIiCi4GARERBLHICAikjgGARGRxDEIiIgkjkFARCRxDAIiIomLCHYBY9Hd3QuvN7Aff0hIiEVnpyugxww29hQa2FNoCOWe
[...]
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "pmf = kll.get_pmf(splits)\n",
+ "plt.bar(x=x,height=pmf,align='edge',width=-.25)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 16,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "x = [i for i in range(-10,11)]"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 17,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "y = np.multiply(x,x)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 18,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "[<matplotlib.lines.Line2D at 0x7f933a602760>]"
+ ]
+ },
+ "execution_count": 18,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png":
"iVBORw0KGgoAAAANSUhEUgAAAXkAAAD7CAYAAACPDORaAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAgAElEQVR4nO3deVxU59028GuGGYZ9n2GTXRA3FkUiLqBxYRM0aBITU5sYo2lTk6aNjTFN0yZNtKnPx6RvHu2TpbFNYlNXXIK4gyhGBRFcUFH2fd+HYZb7/cNIg6Iyw8ycWX7ff1pmznCu3OjF8Zxz34fHGGMghBBikvhcByCEEKI7VPKEEGLCqOQJIcSEUckTQogJo5InhBATRiVPCCEmjEqeEEJMmIDrAPdqa+uBSqX+rfuurnZoaenWQaKRoVzqoVzqM9RslEs9mubi83lwdrZ94PsG
[...]
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "plt.plot(x,y)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 67,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "kll = kll_floats_sketch(160)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 68,
+ "metadata": {},
+ "outputs": [],
+ "source": [
+ "kll.update(np.random.poisson(lam=2.0,size=2**20))\n",
+ "kll.update(np.random.poisson(lam=20.0,size=2**22))"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": 75,
+ "metadata": {},
+ "outputs": [
+ {
+ "data": {
+ "text/plain": [
+ "<BarContainer object of 61 artists>"
+ ]
+ },
+ "execution_count": 75,
+ "metadata": {},
+ "output_type": "execute_result"
+ },
+ {
+ "data": {
+ "image/png":
"iVBORw0KGgoAAAANSUhEUgAAAXwAAAD7CAYAAABpJS8eAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAADh0RVh0U29mdHdhcmUAbWF0cGxvdGxpYiB2ZXJzaW9uMy4yLjIsIGh0dHA6Ly9tYXRwbG90bGliLm9yZy+WH4yJAAAaiUlEQVR4nO3df0wb5x0G8MfkB4XFjBWdvRSt7bZMYWnD0IRWiiZHmUKcAFcQpBpqJi+LRtdqC5vVsdIA3ZI1JY3ovHTpoop2ndrBCqEtlqvMoEbrVg2UDZQmVKFRmZYuYZltIE2A2oDD7Y8otzoY7hxsDH6fz1/cvefz917sx8fruxeDoigKiIgo4SXFuwAiIlocDHwiIkEw8ImIBMHAJyISBAOfiEgQDHwiIkEw8ImIBLEy3gXM5/LlCczMLPw2gYyMNRgZGY9CRcsX++A69gP74IZE
[...]
+ "text/plain": [
+ "<Figure size 432x288 with 1 Axes>"
+ ]
+ },
+ "metadata": {},
+ "output_type": "display_data"
+ }
+ ],
+ "source": [
+ "xmin = kll.get_min_value()\n",
+ "xmax = kll.get_max_value()\n",
+ "num_steps = 60\n",
+ "step = (xmax - xmin) / num_steps\n",
+ "splits = [xmin + (i*step) for i in range(0,num_steps)]\n",
+ "x = splits.copy()\n",
+ "x.append(xmax)\n",
+ "pmf = kll.get_pmf(splits)\n",
+ "plt.bar(x=x,height=pmf,align='edge',width=-.5)"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "execution_count": null,
+ "metadata": {},
+ "outputs": [],
+ "source": []
+ }
+ ],
+ "metadata": {
+ "kernelspec": {
+ "display_name": "Python 3",
+ "language": "python",
+ "name": "python3"
+ },
+ "language_info": {
+ "codemirror_mode": {
+ "name": "ipython",
+ "version": 3
+ },
+ "file_extension": ".py",
+ "mimetype": "text/x-python",
+ "name": "python",
+ "nbconvert_exporter": "python",
+ "pygments_lexer": "ipython3",
+ "version": "3.8.2"
+ }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]