[2/2] incubator-madlib git commit: Sketch: Promote sketch methods to top-level

riyer Thu, 08 Jun 2017 13:31:48 -0700

Sketch: Promote sketch methods to top-level

JIRA: MADLIB-1120


This commit fixes some of the documentation for sketch and moves the
module out of "Early stage development".

Closes #139


Project: http://git-wip-us.apache.org/repos/asf/incubator-madlib/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-madlib/commit/edc9b9f3
Tree: http://git-wip-us.apache.org/repos/asf/incubator-madlib/tree/edc9b9f3
Diff: http://git-wip-us.apache.org/repos/asf/incubator-madlib/diff/edc9b9f3

Branch: refs/heads/master
Commit: edc9b9f358eb0dcf880a6cb38b7fd4dd404be61d
Parents: e05efaa
Author: Rahul Iyer <ri...@apache.org>
Authored: Tue Jun 6 16:09:30 2017 -0700
Committer: Rahul Iyer <ri...@apache.org>
Committed: Thu Jun 8 13:30:57 2017 -0700

----------------------------------------------------------------------
 doc/mainpage.dox.in                    |  14 ++--
 methods/sketch/src/pg_gp/sketch.sql_in | 121 ++++++++++++----------------
 2 files changed, 59 insertions(+), 76 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/edc9b9f3/doc/mainpage.dox.in
----------------------------------------------------------------------
diff --git a/doc/mainpage.dox.in b/doc/mainpage.dox.in
index 8b894b1..407946e 100644
--- a/doc/mainpage.dox.in
+++ b/doc/mainpage.dox.in
@@ -148,6 +148,14 @@ complete matrix stored as a distributed table.
     @ingroup grp_stats
     @{A collection of methods to compute descriptive statistics of the dataset 
@}
 
+        @defgroup grp_sketches Cardinality Estimators
+        @ingroup grp_desc_stats
+        @{
+            @defgroup grp_countmin CountMin (Cormode-Muthukrishnan)
+            @defgroup grp_fmsketch FM (Flajolet-Martin)
+            @defgroup grp_mfvsketch MFV (Most Frequent Values)
+        @}
+
         @defgroup grp_correlation Pearson's Correlation
         @ingroup grp_desc_stats
 
@@ -254,12 +262,6 @@ complete matrix stored as a distributed table.
 There may be some issues that will be addressed in a future version.
 Interface and implementation are subject to change.
 @{
-     @defgroup grp_sketches Cardinality Estimators
-     @{
-        @defgroup grp_countmin CountMin (Cormode-Muthukrishnan)
-        @defgroup grp_fmsketch FM (Flajolet-Martin)
-        @defgroup grp_mfvsketch MFV (Most Frequent Values)
-     @}
 
     @defgroup grp_cg Conjugate Gradient
     @defgroup grp_bayes Naive Bayes Classification

http://git-wip-us.apache.org/repos/asf/incubator-madlib/blob/edc9b9f3/methods/sketch/src/pg_gp/sketch.sql_in
----------------------------------------------------------------------
diff --git a/methods/sketch/src/pg_gp/sketch.sql_in 
b/methods/sketch/src/pg_gp/sketch.sql_in
index a5b7120..2ae36a2 100644
--- a/methods/sketch/src/pg_gp/sketch.sql_in
+++ b/methods/sketch/src/pg_gp/sketch.sql_in
@@ -15,23 +15,19 @@ m4_include(`SQLCommon.m4')
 @addtogroup grp_sketches
 @brief   A collection of methods to estimate the number of unique values 
contained in the data.
 
-\warning <em> These MADlib methods are still in early stage development. There 
may be some 
-issues that will be addressed in future versions. Interface and implementation
-is subject to change. </em>
-
 Sketches (sometimes called "synopsis data structures") are small randomized
 in-memory data structures that capture statistical properties of a large set
 of values (e.g., a column of a table).  Sketches can be formed in a single
 pass of the data, and used to approximate a variety of descriptive statistics.
 
-We implement sketches as SQL User-Defined Aggregates (UDAs). Because they
-are single-pass, small-space and parallelized, a single query can
-use many sketches to gather summary statistics on many columns of a table 
efficiently.
+We implement sketches as SQL User-Defined Aggregates (UDAs). Because they are
+single-pass, small-space and parallelized, a single query can use many sketches
+to gather summary statistics on many columns of a table efficiently.
 
 This module currently implements user-defined aggregates based on three main 
sketch methods:
  - <i>Count-Min (CM)</i> sketches, which can be used to approximate a number 
of descriptive statistics including
-   - <c>COUNT(*)</c> of rows whose column value matches a given value in a set
-   - <c>COUNT(*)</c> of rows whose column value falls in a range (*)
+   - <c>COUNT</c> of rows whose column value matches a given value in a set
+   - <c>COUNT</c> of rows whose column value falls in a range (*)
    - order statistics including <i>median</i> and <i>centiles</i> (*)
    - <i>histograms</i>: both <i>equi-width</i> and <i>equi-depth</i> (*)
  - <i>Flajolet-Martin (FM)</i> sketches for approximating 
<c>COUNT(DISTINCT)</c>.
@@ -57,11 +53,6 @@ and UDFs (user-defined functions), to be used directly in 
SQL queries.
 </ul>
 </div>
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
-
-
 @brief Implements Flajolet-Martin's distinct count estimation
 as a user-defined aggregate.
 
@@ -79,6 +70,13 @@ Get the number of distinct values in a designated column.
 fmsketch_dcount( col_name )
 </pre>
 
+@note This is a [User Defined 
Aggregate](https://www.postgresql.org/docs/current/static/xaggr.html)
+which returns the results when used in a query. Use "CREATE TABLE AS ", with 
the
+UDA as subquery if the results are to be stored. This is unlike the usual 
MADlib
+stored procedure interface which places the results in a table instead of
+returning it.
+
+
 @anchor examples
 @examp
 -# Generate some data.
@@ -128,10 +126,6 @@ File sketch.sql_in documenting the SQL function.
 </ul>
 </div>
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
-
 @brief Implements Cormode-Mathukrishnan <i>CountMin</i> sketches on integer
 values as a user-defined aggregate.
 
@@ -151,31 +145,27 @@ cmsketch( col_name )
   obtained from <tt>cmsketch</tt>.
 <pre class="syntax">
 cmsketch_count( cmsketch,
-                p
-              )
+                p )
 </pre>
 
 - Get the number of rows where <em>col_name</em> is between <em>m</em> and 
<em>n</em> inclusive.
 <pre class="syntax">
 cmsketch_rangecount( cmsketch,
                      m,
-                     n
-                   )
+                     n )
 </pre>
 
 - Get the <em>k</em>th percentile of <em>col_name</em> where <em>count</em> 
specifies number of rows. <em>k</em> should be an integer between 1 to 99.
 <pre class="syntax">
 cmsketch_centile( cmsketch,
                   k,
-                  count
-                )
+                  count )
 </pre>
 
 - Get the median of col_name where <em>count</em> specifies number of rows. 
This is equivalent to <tt>\ref 
cmsketch_centile(<em>cmsketch</em>,50,<em>count</em>)</tt>.
 <pre class="syntax">
 cmsketch_median( cmsketch,
-                 count
-               )
+                 count )
 </pre>
 
 - Get an n-bucket histogram for values between min and max for the column 
where each bucket has approximately the same width. The output is a text string 
containing triples {lo, hi, count} representing the buckets; counts are 
approximate.
@@ -183,17 +173,21 @@ cmsketch_median( cmsketch,
 cmsketch_width_histogram( cmsketch,
                           min,
                           max,
-                          n
-                        )
+                          n )
 </pre>
 
 - Get an n-bucket histogram for the column where each bucket has approximately 
the same count. The output is a text string containing triples {lo, hi, count} 
representing the buckets; counts are approximate.  Note that an equi-depth 
histogram is equivalent to a spanning set of equi-spaced centiles.
 <pre class="syntax">
 cmsketch_depth_histogram( cmsketch,
-                          n
-                        )
+                          n )
 </pre>
 
+@note This is a [User Defined 
Aggregate](https://www.postgresql.org/docs/current/static/xaggr.html)
+which returns the results when used in a query. Use "CREATE TABLE AS ", with 
the
+UDA as subquery if the results are to be stored. This is unlike the usual 
MADlib
+stored procedure interface which places the results in a table instead of
+returning it.
+
 @anchor examples
 @examp
 
@@ -207,14 +201,13 @@ INSERT INTO data SELECT 2,5 FROM generate_series(1,1000);
 INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
 </pre>
 
--# Count number of rows where a1 = 2 in each class.
+-# Count number of rows where a1 = 2 in each class. Store results in a table.
 <pre class="example">
+CREATE TABLE sketch_count AS
 SELECT class,
-       cmsketch_count(
-                       cmsketch( a1 ),
-                       2
-                      )
+       cmsketch_count( cmsketch( a1 ), 2 )
 FROM data GROUP BY data.class;
+SELECT * FROM sketch_count;
 </pre>
 Result:
 <pre class="result">
@@ -228,11 +221,7 @@ Result:
 -# Count number of rows where a1 is between 3 and 6.
 <pre class="example">
 SELECT class,
-       cmsketch_rangecount(
-                            cmsketch(a1),
-                            3,
-                            6
-                          )
+       cmsketch_rangecount( cmsketch(a1), 3, 6 )
 FROM data GROUP BY data.class;
 </pre>
 Result:
@@ -246,11 +235,7 @@ Result:
 
 -# Compute the 90th percentile of all of a1.
 <pre class="example">
-SELECT cmsketch_centile(
-                         cmsketch( a1 ),
-                         90,
-                         count(*)
-                       )
+SELECT cmsketch_centile( cmsketch( a1 ), 90, count(*) )
 FROM data;
 </pre>
 Result:
@@ -263,12 +248,7 @@ Result:
 
 -# Produce an equi-width histogram with 2 bins between 0 and 10.
 <pre class="example">
-SELECT cmsketch_width_histogram(
-                                 cmsketch( a1 ),
-                                 0,
-                                 10,
-                                 2
-                               )
+SELECT cmsketch_width_histogram( cmsketch( a1 ), 0, 10, 2 )
 FROM data;
 </pre>
 Result:
@@ -281,10 +261,7 @@ Result:
 
 -# Produce an equi-depth histogram of a1 with 2 bins of approximately equal 
depth.
 <pre class="example">
-SELECT cmsketch_depth_histogram(
-                                 cmsketch( a1 ),
-                                 2
-                               )
+SELECT cmsketch_depth_histogram( cmsketch( a1 ), 2 )
 FROM data;
 </pre>
 Result:
@@ -321,10 +298,6 @@ Module \ref grp_quantile for a different implementation of 
quantile function.
 </ul>
 </div>
 
-\warning <em> This MADlib method is still in early stage development. There 
may be some
-issues that will be addressed in a future version. Interface and implementation
-is subject to change. </em>
-
 @brief Implements the most frequent values variant of the CountMin sketch as a 
user-defined aggregate.
 
 MFVSketch: Most Frequent Values variant of CountMin sketch, implemented
@@ -335,23 +308,33 @@ most frequent values in the column. The output is an 
array of doubles {value, co
 in descending order of frequency; counts are approximated via CountMin 
sketches.
 Ties are handled arbitrarily.
 
-@anchor syntax
 
-<pre class="syntax">
-mfvsketch_top_histogram( col_name,
-                         n
-                       )
-</pre>
+@anchor syntax
 
 The MFV frequent-value UDA comes in two different versions:
 - a faithful implementation that preserves the approximation guarantees
 of Cormode/Muthukrishnan,
+<pre class="syntax">
+mfvsketch_top_histogram( col_name,
+                         n )
+</pre>
 - and a "quick and dirty" version that can do parallel aggregation in Greenplum
 at the expense of missing some of the most frequent values.
+<pre class="syntax">
+mfvsketch_quick_histogram( col_name,
+                           n )
+</pre>
 
 In PostgreSQL the two UDAs are identical. In Greenplum, the quick version 
should
-produce good results unless the number of values requested is very small,
-or the distribution is very flat.
+produce good results unless the number of values requested is small,
+or the distribution is flat.
+
+@note This is a [User Defined 
Aggregate](https://www.postgresql.org/docs/current/static/xaggr.html)
+which returns the results when used in a query. Use "CREATE TABLE AS ", with 
the
+UDA as subquery if the results are to be stored. This is unlike the usual 
MADlib
+stored procedure interface which places the results in a table instead of
+returning it.
+
 
 @anchor examples
 @examp
@@ -369,9 +352,7 @@ INSERT INTO data SELECT 2,6 FROM generate_series(1,1000);
 -# Produce a histogram of 5 bins and return the most frequent value and 
associated
 count in each bin.
 <pre class="example">
-SELECT mfvsketch_top_histogram( a1,
-                                5
-                              )
+SELECT mfvsketch_top_histogram( a1, 5 )
 FROM data;
 </pre>
 Result:

[2/2] incubator-madlib git commit: Sketch: Promote sketch methods to top-level

Reply via email to