[29/51] [partial] madlib-site git commit: Doc: Add v1.15.1 documentation

nkak Mon, 15 Oct 2018 11:55:23 -0700

http://git-wip-us.apache.org/repos/asf/madlib-site/blob/af0e5f14/docs/v1.15.1/group__grp__balance__sampling.html
----------------------------------------------------------------------
diff --git a/docs/v1.15.1/group__grp__balance__sampling.html 
b/docs/v1.15.1/group__grp__balance__sampling.html
new file mode 100644
index 0000000..fd50fdf
--- /dev/null
+++ b/docs/v1.15.1/group__grp__balance__sampling.html
@@ -0,0 +1,607 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
+<html xmlns="http://www.w3.org/1999/xhtml";>
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.14"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data 
mining,deep learning,ensemble methods,data science,market basket 
analysis,affinity analysis,pca,lda,regression,elastic net,huber 
white,proportional hazards,k-means,latent dirichlet allocation,bayes,support 
vector machines,svm"/>
+<title>MADlib: Balanced Sampling</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+  $(document).ready(initResizable);
+/* @license-end */</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+  $(document).ready(function() { init_search(); });
+/* @license-end */
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" async 
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js";></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+  
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org";><img alt="Logo" 
src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.15.1</span>
+   </div>
+   <div id="projectbrief">User Documentation for Apache MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" 
href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" 
border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.14 -->
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+/* @license-end */
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+$(document).ready(function(){initNavTree('group__grp__balance__sampling.html','');});
+/* @license-end */
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Balanced Sampling<div class="ingroups"><a class="el" 
href="group__grp__sampling.html">Sampling</a></div></div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="toc"><b>Contents</b> <ul>
+<li>
+<a href="#strs">Balanced Sampling</a> </li>
+<li>
+<a href="#examples">Examples</a> </li>
+<li>
+<a href="#literature">Literature</a> </li>
+<li>
+<a href="#related">Related Topics</a> </li>
+</ul>
+</div><p>Some classification algorithms only perform optimally when the number 
of samples in each class is roughly the same. Highly skewed datasets are common 
in many domains (e.g., fraud detection), so resampling to offset this imbalance 
can produce a better decision boundary.</p>
+<p>This module offers a number of resampling techniques including 
undersampling majority classes, oversampling minority classes, and combinations 
of the two.</p>
+<p><a class="anchor" id="strs"></a></p><dl class="section user"><dt>Balanced 
Sampling</dt><dd></dd></dl>
+<pre class="syntax">
+balance_sample( source_table,
+                output_table,
+                class_col,
+                class_sizes,
+                output_table_size,
+                grouping_cols,
+                with_replacement,
+                keep_null
+              )
+</pre><p><b>Arguments</b> </p><dl class="arglist">
+<dt>source_table </dt>
+<dd><p class="startdd">TEXT. Name of the table containing the input data.</p>
+<p class="enddd"></p>
+</dd>
+<dt>output_table </dt>
+<dd><p class="startdd">TEXT. Name of output table that contains the sampled 
data. The output table contains all columns present in the source table, plus a 
new generated id called "__madlib_id__" added as the first column. </p>
+<p class="enddd"></p>
+</dd>
+<dt>class_col </dt>
+<dd><p class="startdd">TEXT, Name of the column containing the class to be 
balanced. </p>
+<p class="enddd"></p>
+</dd>
+<dt>class_sizes (optional) </dt>
+<dd><p class="startdd">VARCHAR, default âuniformâ. Parameter to define the 
size of the different class values. (Class values are sometimes also called 
levels). Can be set to the following:</p>
+<ul>
+<li>
+<b>âuniformâ</b>: All class values will be resampled to have the same 
number of rows.  </li>
+<li>
+<b>'undersample'</b>: Undersample such that all class values end up with the 
same number of observations as the minority class. Done without replacement by 
default unless the parameter âwith_replacementâ is set to TRUE.  </li>
+<li>
+<b>'oversample'</b>: Oversample with replacement such that all class values 
end up with the same number of observations as the majority class. Not affected 
by the parameter âwith_replacementâ since oversampling is always done with 
replacement.  Short forms of the above will work too, e.g., 'uni' works the 
same as 'uniform'. </li>
+</ul>
+<p>Alternatively, you can also explicitly set class size in a string 
containing a comma-delimited list. Order does not matter and all class values 
do not need to be specified. Use the format âclass_value_1=x, 
class_value_2=y, â¦â where 'class_value' in the list must exist in the 
column 'class_col'. Set to an integer representing the desired number of 
observations. E.g., âred=3000, blue=4000â means you want to resample the 
dataset to result in exactly 3000 red and 4000 blue rows in the 
âoutput_tableâ.  </p>
+<dl class="section note"><dt>Note</dt><dd>The allowed names for class values 
follows object naming rules in PostgreSQL [1]. Quoted identifiers are allowed 
and should be enclosed in double quotes in the usual way. If for some reason 
the class values in the examples above were âReDâ and âBluEâ then the 
comma delimited list for âclass_sizeâ would be: ââReDâ=3000, 
âBluEâ=4000â. </dd></dl>
+</dd>
+<dt>output_table_size (optional) </dt>
+<dd><p class="startdd">INTEGER, default NULL. Desired size of the output data 
set. This parameter is ignored if âclass_sizeâ parameter is set to either 
âoversampleâ or âundersampleâ since output table size is already 
determined. If NULL, the resulting output table size will depend on the 
settings for the âclass_sizeâ parameter (see table below for more details). 
</p>
+<p class="enddd"></p>
+</dd>
+<dt>grouping_cols (optional) </dt>
+<dd><p class="startdd">TEXT, default: NULL. A single column or a list of 
comma-separated columns that defines the strata. When this parameter is NULL, 
no grouping is used so the sampling is non-stratified, that is, the whole table 
is treated as a single group.</p>
+<dl class="section note"><dt>Note</dt><dd>The 'output_table_size' and the 
'class_sizes' are defined for the whole table. When grouping is used, these 
parameters are split evenly for each group. Further, if a specific class value 
is specified in the 'class_sizes' parameter, that particular class value should 
be present in each group. If not, an error will be thrown. </dd></dl>
+</dd>
+<dt>with_replacement (optional) </dt>
+<dd><p class="startdd">BOOLEAN, default FALSE. Determines whether to sample 
with replacement or without replacement (default). With replacement means that 
it is possible that the same row may appear in the sample set more than once. 
Without replacement means a given row can be selected only once. This parameter 
affects undersampling only since oversampling is always done with 
replacement.</p>
+<p class="enddd"></p>
+</dd>
+<dt>keep_null (optional) </dt>
+<dd>BOOLEAN, default FALSE. Determines whether to sample rows whose class 
values are NULL. By default, all rows with NULL class values are ignored. If 
this is set to TRUE, then NULL is treated as another class value. </dd>
+</dl>
+<p><b>How Output Table Size is Determined</b></p>
+<p>The rule of thumb is that if you specify a value for 'output_table_size', 
then you will generally get an output table of that size, with some minor 
rounding variations. If you set 'output_table_size' to NULL, then the size of 
the output table will be calculated depending on what you put for the 
'class_size' parameter. The following table shows how the parameters 
'class_size' and 'output_table_size' work together:</p>
+<table class="markdownTable">
+<tr class="markdownTableHead">
+<th class="markdownTableHeadLeft">Case  </th><th 
class="markdownTableHeadLeft">'class_size'  </th><th 
class="markdownTableHeadLeft" colspan="2">'output_   </th></tr>
+<tr class="markdownTableBody" class="markdownTableRowOdd">
+<td class="markdownTableBodyLeft">1  </td><td 
class="markdownTableBodyLeft">'uniform'  </td><td 
class="markdownTableBodyLeft">NULL  </td><td 
class="markdownTableBodyLeft">Resample for uniform class size with output size 
= input size (i.e., balanced).   </td></tr>
+<tr class="markdownTableBody" class="markdownTableRowEven">
+<td class="markdownTableBodyLeft">2  </td><td 
class="markdownTableBodyLeft">'uniform'  </td><td 
class="markdownTableBodyLeft">10000  </td><td 
class="markdownTableBodyLeft">Resample for uniform class size with output size 
= 10K (i.e., balanced).   </td></tr>
+<tr class="markdownTableBody" class="markdownTableRowOdd">
+<td class="markdownTableBodyLeft">3  </td><td 
class="markdownTableBodyLeft">NULL  </td><td class="markdownTableBodyLeft">NULL 
 </td><td class="markdownTableBodyLeft">Resample for uniform class size with 
output size = input size (i.e., balanced). Class_size=NULL has same behavior as 
âuniformâ.   </td></tr>
+<tr class="markdownTableBody" class="markdownTableRowEven">
+<td class="markdownTableBodyLeft">4  </td><td 
class="markdownTableBodyLeft">NULL  </td><td 
class="markdownTableBodyLeft">10000  </td><td 
class="markdownTableBodyLeft">Resample for uniform class size with output size 
= 10K (i.e., balanced). Class_size=NULL has same behavior as âuniformâ.   
</td></tr>
+<tr class="markdownTableBody" class="markdownTableRowOdd">
+<td class="markdownTableBodyLeft">5  </td><td 
class="markdownTableBodyLeft">'undersample'  </td><td 
class="markdownTableBodyLeft">n/a  </td><td 
class="markdownTableBodyLeft">Undersample such that all class values end up 
with the same number of observations as the minority.   </td></tr>
+<tr class="markdownTableBody" class="markdownTableRowEven">
+<td class="markdownTableBodyLeft">6  </td><td 
class="markdownTableBodyLeft">'oversample'  </td><td 
class="markdownTableBodyLeft">n/a  </td><td 
class="markdownTableBodyLeft">Oversample with replacement (always) such that 
all class values end up with the same number of observations as the majority.   
</td></tr>
+<tr class="markdownTableBody" class="markdownTableRowOdd">
+<td class="markdownTableBodyLeft">7  </td><td 
class="markdownTableBodyLeft">'red=3000'  </td><td 
class="markdownTableBodyLeft">NULL  </td><td 
class="markdownTableBodyLeft">Resample red to 3K, leave rest of the class 
values (blue, green, etc.) as is.   </td></tr>
+<tr class="markdownTableBody" class="markdownTableRowEven">
+<td class="markdownTableBodyLeft">8  </td><td 
class="markdownTableBodyLeft">'red=3000, blue=4000'  </td><td 
class="markdownTableBodyLeft">10000  </td><td 
class="markdownTableBodyLeft">Resample red to 3K and blue to 4K, divide 
remaining class values evenly 3K/(n-2) each, where n=number of class values. 
Note that if red and blue are the only class values, then output table size 
will be 7K not 10K. (This is the only case where specifying a value for 
'output_table_size' may not actually result in an output table of that size.)   
</td></tr>
+</table>
+<p><a class="anchor" id="examples"></a></p><dl class="section 
user"><dt>Examples</dt><dd></dd></dl>
+<p>Note that due to the random nature of sampling, your results may look 
different from those below.</p>
+<ol type="1">
+<li>Create an input table using part of the flags data set from <a 
href="https://archive.ics.uci.edu/ml/datasets/Flags";>https://archive.ics.uci.edu/ml/datasets/Flags</a>
 : <pre class="syntax">
+DROP TABLE IF EXISTS flags;
+CREATE TABLE flags (
+    id INTEGER,
+    name TEXT,
+    landmass INTEGER,
+    zone INTEGER,
+    area INTEGER,
+    population INTEGER,
+    language INTEGER,
+    colours INTEGER,
+    mainhue TEXT
+);
+INSERT INTO flags VALUES
+(1, 'Argentina', 2, 3, 2777, 28, 2, 2, 'blue'),
+(2, 'Australia', 6, 2, 7690, 15, 1, 3, 'blue'),
+(3, 'Austria', 3, 1, 84, 8, 4, 2, 'red'),
+(4, 'Brazil', 2, 3, 8512, 119, 6, 4, 'green'),
+(5, 'Canada', 1, 4, 9976, 24, 1, 2, 'red'),
+(6, 'China', 5, 1, 9561, 1008, 7, 2, 'red'),
+(7, 'Denmark', 3, 1, 43, 5, 6, 2, 'red'),
+(8, 'Greece', 3, 1, 132, 10, 6, 2, 'blue'),
+(9, 'Guatemala', 1, 4, 109, 8, 2, 2, 'blue'),
+(10, 'Ireland', 3, 4, 70, 3, 1, 3, 'white'),
+(11, 'Jamaica', 1, 4, 11, 2, 1, 3, 'green'),
+(12, 'Luxembourg', 3, 1, 3, 0, 4, 3, 'red'),
+(13, 'Mexico', 1, 4, 1973, 77, 2, 4, 'green'),
+(14, 'Norway', 3, 1, 324, 4, 6, 3, 'red'),
+(15, 'Portugal', 3, 4, 92, 10, 6, 5, 'red'),
+(16, 'Spain', 3, 4, 505, 38, 2, 2, 'red'),
+(17, 'Sweden', 3, 1, 450, 8, 6, 2, 'blue'),
+(18, 'Switzerland', 3, 1, 41, 6, 4, 2, 'red'),
+(19, 'UK', 3, 4, 245, 56, 1, 3, 'red'),
+(20, 'USA', 1, 4, 9363, 231, 1, 3, 'white'),
+(21, 'xElba', 3, 1, 1, 1, 6, NULL, NULL),
+(22, 'xPrussia', 3, 1, 249, 61, 4, NULL, NULL);
+</pre></li>
+<li>Uniform sampling. All class values will be resampled so that they have the 
same number of rows. The output data size will be the same as the input data 
size, ignoring NULL values. Uniform sampling is the default for the 
'class_size' parameter so we do not need to explicitly set it: <pre 
class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue');          -- Class column
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+             5 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+             2 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+             3 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+             4 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+             1 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+            11 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            12 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            14 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            15 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            13 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             8 |  3 | Austria     |        3 |    1 |   84 |          8 |      
  4 |       2 | red
+            10 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             9 |  7 | Denmark     |        3 |    1 |   43 |          5 |      
  6 |       2 | red
+             6 | 15 | Portugal    |        3 |    4 |   92 |         10 |      
  6 |       5 | red
+             7 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+            19 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            20 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            18 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            16 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+            17 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+(20 rows)
+</pre> Next we do uniform sampling again, but this time we specify a size for 
the output table: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'uniform',           -- Uniform sample
+                               12);                -- Desired output table size
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |   name    | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-----------+----------+------+------+------------+----------+---------+---------
+            10 |  1 | Argentina |        2 |    3 | 2777 |         28 |        
2 |       2 | blue
+            12 |  2 | Australia |        6 |    2 | 7690 |         15 |        
1 |       3 | blue
+            11 |  8 | Greece    |        3 |    1 |  132 |         10 |        
6 |       2 | blue
+             2 |  4 | Brazil    |        2 |    3 | 8512 |        119 |        
6 |       4 | green
+             3 | 11 | Jamaica   |        1 |    4 |   11 |          2 |        
1 |       3 | green
+             1 | 13 | Mexico    |        1 |    4 | 1973 |         77 |        
2 |       4 | green
+             5 |  7 | Denmark   |        3 |    1 |   43 |          5 |        
6 |       2 | red
+             6 | 14 | Norway    |        3 |    1 |  324 |          4 |        
6 |       3 | red
+             4 | 15 | Portugal  |        3 |    4 |   92 |         10 |        
6 |       5 | red
+             9 | 10 | Ireland   |        3 |    4 |   70 |          3 |        
1 |       3 | white
+             7 | 20 | USA       |        1 |    4 | 9363 |        231 |        
1 |       3 | white
+             8 | 20 | USA       |        1 |    4 | 9363 |        231 |        
1 |       3 | white
+(12 rows)
+</pre></li>
+<li>Oversampling. Oversample with replacement such that all class values 
except NULLs end up with the same number of observations as the majority class. 
Countries with red flags is the majority class with 10 observations, so other 
class values will be oversampled to 10 observations: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'oversample');       -- Oversample
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+            35 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            33 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            37 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            34 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            36 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            32 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            31 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+            39 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+            38 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+            40 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+            19 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            20 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            12 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+            11 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+            13 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+            17 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            15 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            16 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            18 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            14 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             9 |  3 | Austria     |        3 |    1 |   84 |          8 |      
  4 |       2 | red
+             8 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             1 |  6 | China       |        5 |    1 | 9561 |       1008 |      
  7 |       2 | red
+            10 |  7 | Denmark     |        3 |    1 |   43 |          5 |      
  6 |       2 | red
+             2 | 12 | Luxembourg  |        3 |    1 |    3 |          0 |      
  4 |       3 | red
+             4 | 14 | Norway      |        3 |    1 |  324 |          4 |      
  6 |       3 | red
+             6 | 15 | Portugal    |        3 |    4 |   92 |         10 |      
  6 |       5 | red
+             3 | 16 | Spain       |        3 |    4 |  505 |         38 |      
  2 |       2 | red
+             5 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+             7 | 19 | UK          |        3 |    4 |  245 |         56 |      
  1 |       3 | red
+            22 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            26 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            24 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            21 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            27 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            25 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            23 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            29 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+            30 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+            28 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+(40 rows)
+</pre></li>
+<li>Undersampling. Undersample such that all class values except NULLs end up 
with the same number of observations as the minority class. Countries with 
white flags is the minority class with 2 observations, so other class values 
will be undersampled to 2 observations: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'undersample');      -- Undersample
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+             1 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+             2 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+             4 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+             3 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             5 | 16 | Spain       |        3 |    4 |  505 |         38 |      
  2 |       2 | red
+             6 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+             8 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+             7 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+(8 rows)
+</pre> We may also want to undersample with replacement, so we set the 
'with_replacement' parameter to TRUE: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'undersample',       -- Undersample
+                               NULL,               -- Output table size will 
be calculated
+                               NULL,               -- No grouping
+                              'TRUE');             -- Sample with replacement
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |   name    | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-----------+----------+------+------+------------+----------+---------+---------
+             2 |  9 | Guatemala |        1 |    4 |  109 |          8 |        
2 |       2 | blue
+             1 |  9 | Guatemala |        1 |    4 |  109 |          8 |        
2 |       2 | blue
+             3 |  4 | Brazil    |        2 |    3 | 8512 |        119 |        
6 |       4 | green
+             4 | 13 | Mexico    |        1 |    4 | 1973 |         77 |        
2 |       4 | green
+             6 |  5 | Canada    |        1 |    4 | 9976 |         24 |        
1 |       2 | red
+             5 | 19 | UK        |        3 |    4 |  245 |         56 |        
1 |       3 | red
+             7 | 20 | USA       |        1 |    4 | 9363 |        231 |        
1 |       3 | white
+             8 | 20 | USA       |        1 |    4 | 9363 |        231 |        
1 |       3 | white
+(8 rows)
+</pre> Note above that some rows may appear multiple times above since we 
sampled with replacement.</li>
+<li>Setting class size by count. Here we set the number of rows for red and 
blue flags, and leave green and white flags unchanged: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'red=7, blue=7');    -- Want 7 reds and 7 blues
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name    | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+------------+----------+------+------+------------+----------+---------+---------
+             5 |  2 | Australia  |        6 |    2 | 7690 |         15 |       
 1 |       3 | blue
+             7 |  8 | Greece     |        3 |    1 |  132 |         10 |       
 6 |       2 | blue
+             6 |  8 | Greece     |        3 |    1 |  132 |         10 |       
 6 |       2 | blue
+             1 |  9 | Guatemala  |        1 |    4 |  109 |          8 |       
 2 |       2 | blue
+             3 | 17 | Sweden     |        3 |    1 |  450 |          8 |       
 6 |       2 | blue
+             2 | 17 | Sweden     |        3 |    1 |  450 |          8 |       
 6 |       2 | blue
+             4 | 17 | Sweden     |        3 |    1 |  450 |          8 |       
 6 |       2 | blue
+             8 |  4 | Brazil     |        2 |    3 | 8512 |        119 |       
 6 |       4 | green
+            18 | 11 | Jamaica    |        1 |    4 |   11 |          2 |       
 1 |       3 | green
+            19 | 13 | Mexico     |        1 |    4 | 1973 |         77 |       
 2 |       4 | green
+            13 |  3 | Austria    |        3 |    1 |   84 |          8 |       
 4 |       2 | red
+            14 |  5 | Canada     |        1 |    4 | 9976 |         24 |       
 1 |       2 | red
+            17 |  6 | China      |        5 |    1 | 9561 |       1008 |       
 7 |       2 | red
+            15 | 12 | Luxembourg |        3 |    1 |    3 |          0 |       
 4 |       3 | red
+            16 | 14 | Norway     |        3 |    1 |  324 |          4 |       
 6 |       3 | red
+            11 | 15 | Portugal   |        3 |    4 |   92 |         10 |       
 6 |       5 | red
+            12 | 16 | Spain      |        3 |    4 |  505 |         38 |       
 2 |       2 | red
+             9 | 10 | Ireland    |        3 |    4 |   70 |          3 |       
 1 |       3 | white
+            10 | 20 | USA        |        1 |    4 | 9363 |        231 |       
 1 |       3 | white
+(19 rows)
+</pre> Next we set the number of rows for red and blue flags, and also set an 
output table size. This means that green and white flags will be uniformly 
sampled to get to the desired output table size: <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                              'red=7, blue=7',     -- Want 7 reds and 7 blues
+                               22);                -- Desired output table size
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+            16 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            20 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+            21 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+            22 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+            18 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+            19 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+            17 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+             9 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            10 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+             8 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+            11 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             6 |  3 | Austria     |        3 |    1 |   84 |          8 |      
  4 |       2 | red
+             7 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             2 |  7 | Denmark     |        3 |    1 |   43 |          5 |      
  6 |       2 | red
+             1 | 12 | Luxembourg  |        3 |    1 |    3 |          0 |      
  4 |       3 | red
+             3 | 15 | Portugal    |        3 |    4 |   92 |         10 |      
  6 |       5 | red
+             5 | 16 | Spain       |        3 |    4 |  505 |         38 |      
  2 |       2 | red
+             4 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+            14 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            13 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            15 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            12 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+(22 rows)
+</pre></li>
+<li>To make NULL a valid class value, set the parameter to keep NULLs: <pre 
class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+                              'flags',             -- Source table
+                              'output_table',      -- Output table
+                              'mainhue',           -- Class column
+                               NULL,               -- Uniform
+                               NULL,               -- Output table size
+                               NULL,               -- No grouping
+                               NULL,               -- Sample without 
replacement
+                              'TRUE');             -- Make NULLs a valid class 
value
+SELECT * FROM output_table ORDER BY mainhue, name;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+            25 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+            22 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+            24 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+            21 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+            23 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+             7 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+             6 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+            10 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+             8 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             9 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             3 |  3 | Austria     |        3 |    1 |   84 |          8 |      
  4 |       2 | red
+             1 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             2 | 16 | Spain       |        3 |    4 |  505 |         38 |      
  2 |       2 | red
+             4 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+             5 | 19 | UK          |        3 |    4 |  245 |         56 |      
  1 |       3 | red
+            13 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            11 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            14 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            12 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+            15 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+            17 | 21 | xElba       |        3 |    1 |    1 |          1 |      
  6 |         |
+            18 | 21 | xElba       |        3 |    1 |    1 |          1 |      
  6 |         |
+            16 | 21 | xElba       |        3 |    1 |    1 |          1 |      
  6 |         |
+            20 | 22 | xPrussia    |        3 |    1 |  249 |         61 |      
  4 |         |
+            19 | 22 | xPrussia    |        3 |    1 |  249 |         61 |      
  4 |         |
+(25 rows)
+</pre></li>
+<li>To perform the balance sampling for independent groups, use the 
'grouping_cols' parameter. Note below that each group (zone) has a different 
count of the classes (mainhue), with some groups not containing some class 
values. <pre class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+    'flags',          -- Source table
+    'output_table',   -- Output table
+    'mainhue',        -- Class column
+    NULL,             -- Uniform
+    NULL,             -- Output table size
+    'zone'            -- Grouping by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+             6 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+             5 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+             8 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+             7 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+             2 |  7 | Denmark     |        3 |    1 |   43 |          5 |      
  6 |       2 | red
+             1 |  6 | China       |        5 |    1 | 9561 |       1008 |      
  7 |       2 | red
+             4 | 12 | Luxembourg  |        3 |    1 |    3 |          0 |      
  4 |       3 | red
+             3 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+             1 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+             1 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+             2 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+             6 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+             5 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+             4 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+            12 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            10 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+            11 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             1 | 19 | UK          |        3 |    4 |  245 |         56 |      
  1 |       3 | red
+             3 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             2 | 15 | Portugal    |        3 |    4 |   92 |         10 |      
  6 |       5 | red
+             8 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+             7 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+             9 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+(23 rows)
+</pre></li>
+<li>Grouping can be used with class size specification as well. Note below 
that 'blue=&lt;Integer&gt;' is the only valid class value since 'blue' is the 
only class value that is present in each group. Further, 'blue=8' will be split 
between the four groups, resulting in two blue rows for each group. <pre 
class="syntax">
+DROP TABLE IF EXISTS output_table;
+SELECT madlib.balance_sample(
+    'flags',          -- Source table
+    'output_table',   -- Output table
+    'mainhue',        -- Class column
+    'blue=8',         -- Specified class value size. Rest of the values are 
outputed as is.
+    NULL,             -- Output table size
+    'zone'            -- Group by zone
+);
+SELECT * FROM output_table ORDER BY zone, mainhue;
+</pre> <pre class="result">
+ __madlib_id__ | id |    name     | landmass | zone | area | population | 
language | colours | mainhue
+---------------+----+-------------+----------+------+------+------------+----------+---------+---------
+             2 | 17 | Sweden      |        3 |    1 |  450 |          8 |      
  6 |       2 | blue
+             1 |  8 | Greece      |        3 |    1 |  132 |         10 |      
  6 |       2 | blue
+             3 |  3 | Austria     |        3 |    1 |   84 |          8 |      
  4 |       2 | red
+             5 |  7 | Denmark     |        3 |    1 |   43 |          5 |      
  6 |       2 | red
+             4 |  6 | China       |        5 |    1 | 9561 |       1008 |      
  7 |       2 | red
+             8 | 18 | Switzerland |        3 |    1 |   41 |          6 |      
  4 |       2 | red
+             7 | 14 | Norway      |        3 |    1 |  324 |          4 |      
  6 |       3 | red
+             6 | 12 | Luxembourg  |        3 |    1 |    3 |          0 |      
  4 |       3 | red
+             1 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+             2 |  2 | Australia   |        6 |    2 | 7690 |         15 |      
  1 |       3 | blue
+             1 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+             2 |  1 | Argentina   |        2 |    3 | 2777 |         28 |      
  2 |       2 | blue
+             3 |  4 | Brazil      |        2 |    3 | 8512 |        119 |      
  6 |       4 | green
+             2 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+             1 |  9 | Guatemala   |        1 |    4 |  109 |          8 |      
  2 |       2 | blue
+             5 | 11 | Jamaica     |        1 |    4 |   11 |          2 |      
  1 |       3 | green
+             6 | 13 | Mexico      |        1 |    4 | 1973 |         77 |      
  2 |       4 | green
+             3 |  5 | Canada      |        1 |    4 | 9976 |         24 |      
  1 |       2 | red
+             7 | 15 | Portugal    |        3 |    4 |   92 |         10 |      
  6 |       5 | red
+             8 | 16 | Spain       |        3 |    4 |  505 |         38 |      
  2 |       2 | red
+             9 | 19 | UK          |        3 |    4 |  245 |         56 |      
  1 |       3 | red
+            10 | 20 | USA         |        1 |    4 | 9363 |        231 |      
  1 |       3 | white
+             4 | 10 | Ireland     |        3 |    4 |   70 |          3 |      
  1 |       3 | white
+(23 rows)
+</pre></li>
+</ol>
+<p><a class="anchor" id="literature"></a></p><dl class="section 
user"><dt>Literature</dt><dd></dd></dl>
+<p>[1] Object naming in PostgreSQL <a 
href="https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS";>https://www.postgresql.org/docs/current/static/sql-syntax-lexical.html#SQL-SYNTAX-IDENTIFIERS</a></p>
+<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related 
Topics</dt><dd></dd></dl>
+<p>File <a class="el" href="balance__sample_8sql__in.html" title="SQL 
functions for balanced data sets sampling. ">balance_sample.sql_in</a> for list 
of functions and usage. </p>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Mon Oct 15 2018 11:24:30 for MADlib by
+    <a href="http://www.doxygen.org/index.html";>
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.14 </li>
+  </ul>
+</div>
+</body>
+</html>


http://git-wip-us.apache.org/repos/asf/madlib-site/blob/af0e5f14/docs/v1.15.1/group__grp__bayes.html
----------------------------------------------------------------------
diff --git a/docs/v1.15.1/group__grp__bayes.html 
b/docs/v1.15.1/group__grp__bayes.html
new file mode 100644
index 0000000..0b11c8c
--- /dev/null
+++ b/docs/v1.15.1/group__grp__bayes.html
@@ -0,0 +1,495 @@
+<!-- HTML header for doxygen 1.8.4-->
+<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" 
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd";>
+<html xmlns="http://www.w3.org/1999/xhtml";>
+<head>
+<meta http-equiv="Content-Type" content="text/xhtml;charset=UTF-8"/>
+<meta http-equiv="X-UA-Compatible" content="IE=9"/>
+<meta name="generator" content="Doxygen 1.8.14"/>
+<meta name="keywords" content="madlib,postgres,greenplum,machine learning,data 
mining,deep learning,ensemble methods,data science,market basket 
analysis,affinity analysis,pca,lda,regression,elastic net,huber 
white,proportional hazards,k-means,latent dirichlet allocation,bayes,support 
vector machines,svm"/>
+<title>MADlib: Naive Bayes Classification</title>
+<link href="tabs.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="jquery.js"></script>
+<script type="text/javascript" src="dynsections.js"></script>
+<link href="navtree.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="resize.js"></script>
+<script type="text/javascript" src="navtreedata.js"></script>
+<script type="text/javascript" src="navtree.js"></script>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+  $(document).ready(initResizable);
+/* @license-end */</script>
+<link href="search/search.css" rel="stylesheet" type="text/css"/>
+<script type="text/javascript" src="search/searchdata.js"></script>
+<script type="text/javascript" src="search/search.js"></script>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+  $(document).ready(function() { init_search(); });
+/* @license-end */
+</script>
+<script type="text/x-mathjax-config">
+  MathJax.Hub.Config({
+    extensions: ["tex2jax.js", "TeX/AMSmath.js", "TeX/AMSsymbols.js"],
+    jax: ["input/TeX","output/HTML-CSS"],
+});
+</script><script type="text/javascript" async 
src="https://cdnjs.cloudflare.com/ajax/libs/mathjax/2.7.2/MathJax.js";></script>
+<!-- hack in the navigation tree -->
+<script type="text/javascript" src="eigen_navtree_hacks.js"></script>
+<link href="doxygen.css" rel="stylesheet" type="text/css" />
+<link href="madlib_extra.css" rel="stylesheet" type="text/css"/>
+<!-- google analytics -->
+<script>
+  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
+  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new 
Date();a=s.createElement(o),
+  
m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
+  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');
+  ga('create', 'UA-45382226-1', 'madlib.apache.org');
+  ga('send', 'pageview');
+</script>
+</head>
+<body>
+<div id="top"><!-- do not remove this div, it is closed by doxygen! -->
+<div id="titlearea">
+<table cellspacing="0" cellpadding="0">
+ <tbody>
+ <tr style="height: 56px;">
+  <td id="projectlogo"><a href="http://madlib.apache.org";><img alt="Logo" 
src="madlib.png" height="50" style="padding-left:0.5em;" border="0"/ ></a></td>
+  <td style="padding-left: 0.5em;">
+   <div id="projectname">
+   <span id="projectnumber">1.15.1</span>
+   </div>
+   <div id="projectbrief">User Documentation for Apache MADlib</div>
+  </td>
+   <td>        <div id="MSearchBox" class="MSearchBoxInactive">
+        <span class="left">
+          <img id="MSearchSelect" src="search/mag_sel.png"
+               onmouseover="return searchBox.OnSearchSelectShow()"
+               onmouseout="return searchBox.OnSearchSelectHide()"
+               alt=""/>
+          <input type="text" id="MSearchField" value="Search" accesskey="S"
+               onfocus="searchBox.OnSearchFieldFocus(true)" 
+               onblur="searchBox.OnSearchFieldFocus(false)" 
+               onkeyup="searchBox.OnSearchFieldChange(event)"/>
+          </span><span class="right">
+            <a id="MSearchClose" 
href="javascript:searchBox.CloseResultsWindow()"><img id="MSearchCloseImg" 
border="0" src="search/close.png" alt=""/></a>
+          </span>
+        </div>
+</td>
+ </tr>
+ </tbody>
+</table>
+</div>
+<!-- end header part -->
+<!-- Generated by Doxygen 1.8.14 -->
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+var searchBox = new SearchBox("searchBox", "search",false,'Search');
+/* @license-end */
+</script>
+</div><!-- top -->
+<div id="side-nav" class="ui-resizable side-nav-resizable">
+  <div id="nav-tree">
+    <div id="nav-tree-contents">
+      <div id="nav-sync" class="sync"></div>
+    </div>
+  </div>
+  <div id="splitbar" style="-moz-user-select:none;" 
+       class="ui-resizable-handle">
+  </div>
+</div>
+<script type="text/javascript">
+/* @license 
magnet:?xt=urn:btih:cf05388f2679ee054f2beb29a391d25f4e673ac3&amp;dn=gpl-2.0.txt 
GPL-v2 */
+$(document).ready(function(){initNavTree('group__grp__bayes.html','');});
+/* @license-end */
+</script>
+<div id="doc-content">
+<!-- window showing the filter options -->
+<div id="MSearchSelectWindow"
+     onmouseover="return searchBox.OnSearchSelectShow()"
+     onmouseout="return searchBox.OnSearchSelectHide()"
+     onkeydown="return searchBox.OnSearchSelectKey(event)">
+</div>
+
+<!-- iframe showing the search results (closed by default) -->
+<div id="MSearchResultsWindow">
+<iframe src="javascript:void(0)" frameborder="0" 
+        name="MSearchResults" id="MSearchResults">
+</iframe>
+</div>
+
+<div class="header">
+  <div class="headertitle">
+<div class="title">Naive Bayes Classification<div class="ingroups"><a 
class="el" href="group__grp__early__stage.html">Early Stage 
Development</a></div></div>  </div>
+</div><!--header-->
+<div class="contents">
+<div class="toc"><b>Contents</b> <ul>
+<li>
+<a href="#train">Training Function(s)</a> </li>
+<li>
+<a href="#classify">Classify Function(s)</a> </li>
+<li>
+<a href="#probabilities">Probabilities Function(s)</a> </li>
+<li>
+<a href="#adhoc">Ad Hoc Computation</a> </li>
+<li>
+<a href="#notes">Implementation Notes</a> </li>
+<li>
+<a href="#examples">Examples</a> </li>
+<li>
+<a href="#background">Technical Background</a> </li>
+<li>
+<a href="#related">Related Topics</a> </li>
+</ul>
+</div><dl class="section warning"><dt>Warning</dt><dd><em> This MADlib method 
is still in early stage development. There may be some issues that will be 
addressed in a future version. Interface and implementation is subject to 
change. </em></dd></dl>
+<p>Naive Bayes refers to a stochastic model where all independent variables \( 
a_1, \dots, a_n \) (often referred to as attributes in this context) 
independently contribute to the probability that a data point belongs to a 
certain class \( c \).</p>
+<p>Naives Bayes classification estimates feature probabilities and class 
priors using maximum likelihood or Laplacian smoothing. For numeric attributes, 
Gaussian smoothing can be used to estimate the feature probabilities.These 
parameters are then used to classify new data.</p>
+<p><a class="anchor" id="train"></a></p><dl class="section user"><dt>Training 
Function(s)</dt><dd></dd></dl>
+<p>For data with only categorical attributes, precompute feature probabilities 
and class priors using the following function:</p>
+<pre class="syntax">
+create_nb_prepared_data_tables ( trainingSource,
+                                 trainingClassColumn,
+                                 trainingAttrColumn,
+                                 numAttrs,
+                                 featureProbsName,
+                                 classPriorsName
+                               )
+</pre><p>For data containing both categorical and numeric attributes, use the 
following form to precompute the Gaussian parameters (mean and variance) for 
numeric attributes alongside the feature probabilities for categorical 
attributes and class priors.</p>
+<pre class="syntax">
+create_nb_prepared_data_tables ( trainingSource,
+                                 trainingClassColumn,
+                                 trainingAttrColumn,
+                                 numericAttrsColumnIndices,
+                                 numAttrs,
+                                 featureProbsName,
+                                 numericAttrParamsName,
+                                 classPriorsName
+                               )
+</pre><p>The <em>trainingSource</em> is expected to be of the following form: 
</p><pre>{TABLE|VIEW} <em>trainingSource</em> (
+    ...
+    <em>trainingClassColumn</em> INTEGER,
+    <em>trainingAttrColumn</em> INTEGER[] OR NUMERIC[] OR FLOAT8[],
+    ...
+)</pre><p><em>numericAttrsColumnIndices</em> should be of type TEXT, specified 
as an array of indices (starting from 1) in the <em>trainingAttrColumn</em> 
attributes-array that correspond to numeric attributes.</p>
+<p>The two output tables are:</p><ul>
+<li><em>featureProbsName</em> &ndash; stores feature probabilities</li>
+<li><em>classPriorsName</em> &ndash; stores the class priors</li>
+</ul>
+<p>In addition to the above, if the function specifying numeric attributes is 
used, an additional table <em>numericAttrParamsName</em> is created which 
stores the Gaussian parameters for the numeric attributes.</p>
+<p><a class="anchor" id="classify"></a></p><dl class="section 
user"><dt>Classify Function(s)</dt><dd></dd></dl>
+<p>Perform Naive Bayes classification: </p><pre class="syntax">
+create_nb_classify_view ( featureProbsName,
+                          classPriorsName,
+                          classifySource,
+                          classifyKeyColumn,
+                          classifyAttrColumn,
+                          numAttrs,
+                          destName
+                        )
+</pre><p>For data with numeric attributes, use the following version:</p>
+<pre class="syntax">
+create_nb_classify_view ( featureProbsName,
+                          classPriorsName,
+                          classifySource,
+                          classifyKeyColumn,
+                          classifyAttrColumn,
+                          numAttrs,
+                          numericAttrParamsName,
+                          destName
+                        )
+</pre><p>The <b>data to classify</b> is expected to be of the following form: 
</p><pre>{TABLE|VIEW} <em>classifySource</em> (
+    ...
+    <em>classifyKeyColumn</em> ANYTYPE,
+    <em>classifyAttrColumn</em> INTEGER[],
+    ...
+)</pre><p>This function creates the view <code><em>destName</em></code> 
mapping <em>classifyKeyColumn</em> to the Naive Bayes classification. </p><pre 
class="result">
+key | nb_classification
+&#160;---+------------------
+...
+</pre><p><a class="anchor" id="probabilities"></a></p><dl class="section 
user"><dt>Probabilities Function(s)</dt><dd></dd></dl>
+<p>Compute Naive Bayes probabilities. </p><pre class="syntax">
+create_nb_probs_view( featureProbsName,
+                      classPriorsName,
+                      classifySource,
+                      classifyKeyColumn,
+                      classifyAttrColumn,
+                      numAttrs,
+                      destName
+                    )
+</pre><p>For data with numeric attributes , use the following version:</p>
+<pre class="syntax">
+create_nb_probs_view( featureProbsName,
+                      classPriorsName,
+                      classifySource,
+                      classifyKeyColumn,
+                      classifyAttrColumn,
+                      numAttrs,
+                      numericAttrParamsName,
+                      destName
+                    )
+</pre><p>This creates the view <code><em>destName</em></code> mapping 
<em>classifyKeyColumn</em> and every single class to the Naive Bayes 
probability: </p><pre class="result">
+key | class | nb_prob
+&#160;---+-------+--------
+...
+</pre><p><a class="anchor" id="adhoc"></a></p><dl class="section user"><dt>Ad 
Hoc Computation Function</dt><dd></dd></dl>
+<p>With ad hoc execution (no precomputation), the functions <a class="el" 
href="bayes_8sql__in.html#a798402280fc6db710957ae3ab58767e0" title="Create a 
view with columns (key, nb_classification) ">create_nb_classify_view()</a> and 
<a class="el" href="bayes_8sql__in.html#a163afffd0c845d325f060f74bcf02243" 
title="Create view with columns (key, class, nb_prob) 
">create_nb_probs_view()</a> can be used in an ad-hoc fashion without the 
precomputation step. In this case, replace the function arguments</p>
+<pre>'<em>featureProbsName</em>', '<em>classPriorsName</em>'</pre><p> with 
</p><pre>'<em>trainingSource</em>', '<em>trainingClassColumn</em>', 
'<em>trainingAttrColumn</em>'</pre><p> for data without any any numeric 
attributes and with </p><pre>'<em>trainingSource</em>', 
'<em>trainingClassColumn</em>', '<em>trainingAttrColumn</em>', 
'<em>numericAttrsColumnIndices</em>'</pre><p> for data containing numeric 
attributes as well.</p>
+<p><a class="anchor" id="notes"></a></p><dl class="section 
user"><dt>Implementation Notes</dt><dd><ul>
+<li>The probabilities computed on the platforms of PostgreSQL and Greenplum 
database have a small difference due to the nature of floating point 
computation. Usually this is not important. However, if a data point has <p 
class="formulaDsp">
+\[ P(C=c_i \mid A) \approx P(C=c_j \mid A) \]
+</p>
+ for two classes, this data point might be classified into diferent classes on 
PostgreSQL and Greenplum. This leads to the differences in classifications on 
PostgreSQL and Greenplum for some data sets, but this should not affect the 
quality of the results.</li>
+<li>When two classes have equal and highest probability among all classes, the 
classification result is an array of these two classes, but the order of the 
two classes is random.</li>
+<li>The current implementation of Naive Bayes classification is suitable for 
discontinuous (categorial) attributes as well as continuous (numeric) 
attributes.<br />
+For continuous data, a typical assumption, usually used for small datasets, is 
that the continuous values associated with each class are distributed according 
to a Gaussian distribution, and the probabilities \( P(A_i = a \mid C=c) \) are 
estimated using the Gaussian Distribution formula: <p class="formulaDsp">
+\[ P(A_i=a \mid C=c) = 
\frac{1}{\sqrt{2\pi\sigma^{2}_c}}exp\left(-\frac{(a-\mu_c)^{2}}{2\sigma^{2}_c}\right)
 \]
+</p>
+ where \(\mu_c\) and \(\sigma^{2}_c\) are the population mean and variance of 
the attribute for the class \(c\).<br />
+Another common technique for handling continuous values, which is better for 
large data sets, is to use binning to discretize the values, and convert the 
continuous data into categorical bins. This approach is currently not 
implemented.</li>
+<li>One can provide floating point data to the Naive Bayes classification 
function. If the corresponding attribute index is not specified in 
<em>numericAttrsColumnIndices</em>, floating point numbers will be used as 
symbolic substitutions for categorial data. In this case, the classification 
would work best if there are sufficient data points for each floating point 
attribute. However, if floating point numbers are used as continuous data 
without the attribute being marked as of type numeric in 
<em>numericAttrsColumnIndices</em>, no warning is raised and the result may not 
be as expected.</li>
+</ul>
+</dd></dl>
+<p><a class="anchor" id="examples"></a></p><dl class="section 
user"><dt>Examples</dt><dd></dd></dl>
+<p>The following is an extremely simplified example of the above option #1 
which can by verified by hand.</p>
+<ol type="1">
+<li>The training and the classification data. <pre class="example">
+SELECT * FROM training;
+</pre> Result: <pre class="result">
+ id | class | attributes
+&#160;---+-------+------------
+  1 |     1 | {1,2,3}
+  2 |     1 | {1,2,1}
+  3 |     1 | {1,4,3}
+  4 |     2 | {1,2,2}
+  5 |     2 | {0,2,2}
+  6 |     2 | {0,1,3}
+(6 rows)
+</pre> <pre class="example">
+SELECT * FROM toclassify;
+</pre> Result: <pre class="result">
+ id | attributes
+&#160;---+------------
+  1 | {0,2,1}
+  2 | {1,2,3}
+(2 rows)
+</pre></li>
+<li>Precompute feature probabilities and class priors. <pre class="example">
+SELECT madlib.create_nb_prepared_data_tables( 'training',
+                                              'class',
+                                              'attributes',
+                                              3,
+                                              'nb_feature_probs',
+                                              'nb_class_priors'
+                                            );
+</pre></li>
+<li>Optionally check the contents of the precomputed tables. <pre 
class="example">
+SELECT * FROM nb_class_priors;
+</pre> Result: <pre class="result">
+ class | class_cnt | all_cnt
+&#160;------+-----------+---------
+     1 |         3 |       6
+     2 |         3 |       6
+(2 rows)
+</pre> <pre class="example">
+SELECT * FROM nb_feature_probs;
+</pre> Result: <pre class="result">
+ class | attr | value | cnt | attr_cnt
+&#160;------+------+-------+-----+----------
+     1 |    1 |     0 |   0 |        2
+     1 |    1 |     1 |   3 |        2
+     1 |    2 |     1 |   0 |        3
+     1 |    2 |     2 |   2 |        3
+...
+</pre></li>
+<li>Create the view with Naive Bayes classification and check the results. 
<pre class="example">
+SELECT madlib.create_nb_classify_view( 'nb_feature_probs',
+                                       'nb_class_priors',
+                                       'toclassify',
+                                       'id',
+                                       'attributes',
+                                       3,
+                                       'nb_classify_view_fast'
+                                     );
+&#160;
+SELECT * FROM nb_classify_view_fast;
+</pre> Result: <pre class="result">
+ key | nb_classification
+&#160;----+-------------------
+   1 | {2}
+   2 | {1}
+(2 rows)
+</pre></li>
+<li>Look at the probabilities for each class (note that we use "Laplacian 
smoothing"), <pre class="example">
+SELECT madlib.create_nb_probs_view( 'nb_feature_probs',
+                                    'nb_class_priors',
+                                    'toclassify',
+                                    'id',
+                                    'attributes',
+                                    3,
+                                    'nb_probs_view_fast'
+                                  );
+&#160;
+SELECT * FROM nb_probs_view_fast;
+</pre> Result: <pre class="result">
+ key | class | nb_prob
+&#160;----+-------+---------
+   1 |     1 |     0.4
+   1 |     2 |     0.6
+   2 |     1 |    0.75
+   2 |     2 |    0.25
+(4 rows)
+</pre></li>
+</ol>
+<p>The following is an example of using a dataset with both numeric and 
categorical attributes</p>
+<ol type="1">
+<li>The training and the classification data. Attributes 
{height(numeric),weight(numeric),shoe size(categorical)}, 
Class{sex(1=male,2=female)} <pre class="example">
+SELECT * FROM gaussian_data;
+</pre> Result: <pre class="result">
+ id | sex |  attributes
+&#160;----+-----+---------------
+  1 |   1 | {6,180,12}
+  2 |   1 | {5.92,190,12}
+  3 |   1 | {5.58,170,11}
+  4 |   1 | {5.92,165,11}
+  5 |   2 | {5,100,6}
+  6 |   2 | {5.5,150,6}
+  7 |   2 | {5.42,130,7}
+  8 |   2 | {5.75,150,8}
+(8 rows)
+</pre> <pre class="example">
+SELECT * FROM gaussian_test;
+</pre> Result: <pre class="result">
+ id | sex |  attributes
+----+-----+--------------
+  9 |   1 | {5.8,180,11}
+ 10 |   2 | {5,160,6}
+(2 rows)
+</pre></li>
+<li>Precompute feature probabilities and class priors. <pre class="example">
+SELECT madlib.create_nb_prepared_data_tables( 'gaussian_data',
+                                              'sex',
+                                              'attributes',
+                                              'ARRAY[1,2]',
+                                              3,
+                                              'categ_feature_probs',
+                                              'numeric_attr_params',
+                                              'class_priors'
+                                            );
+</pre></li>
+<li>Optionally check the contents of the precomputed tables. <pre 
class="example">
+SELECT * FROM class_priors;
+</pre> Result: <pre class="result">
+class | class_cnt | all_cnt
+&#160;-------+-----------+---------
+     1 |         4 |       8
+     2 |         4 |       8
+(2 rows)
+</pre> <pre class="example">
+SELECT * FROM categ_feature_probs;
+</pre> Result: <pre class="result">
+ class | attr | value | cnt | attr_cnt
+-------+------+-------+-----+----------
+     2 |    3 |     6 |   2 |        5
+     1 |    3 |    12 |   2 |        5
+     2 |    3 |     7 |   1 |        5
+     1 |    3 |    11 |   2 |        5
+     2 |    3 |     8 |   1 |        5
+     2 |    3 |    12 |   0 |        5
+     1 |    3 |     6 |   0 |        5
+     2 |    3 |    11 |   0 |        5
+     1 |    3 |     8 |   0 |        5
+     1 |    3 |     7 |   0 |        5
+(10 rows)
+</pre> <pre class="example">
+SELECT * FROM numeric_attr_params;
+</pre> Result: <pre class="result">
+class | attr |      attr_mean       |        attr_var
+-------+------+----------------------+------------------------
+     1 |    1 |   5.8550000000000000 | 0.03503333333333333333
+     1 |    2 | 176.2500000000000000 |   122.9166666666666667
+     2 |    1 |   5.4175000000000000 | 0.09722500000000000000
+     2 |    2 | 132.5000000000000000 |   558.3333333333333333
+(4 rows)
+</pre></li>
+<li>Create the view with Naive Bayes classification and check the results. 
<pre class="example">
+SELECT madlib.create_nb_classify_view( 'categ_feature_probs',
+                                       'class_priors',
+                                       'gaussian_test',
+                                       'id',
+                                       'attributes',
+                                       3,
+                                       'numeric_attr_params',
+                                       'classify_view'
+                                     );
+&#160;
+SELECT * FROM classify_view;
+</pre> Result: <pre class="result">
+ key | nb_classification
+&#160;----+-------------------
+   9 | {1}
+   10 | {2}
+(2 rows)
+</pre></li>
+<li>Look at the probabilities for each class <pre class="example">
+SELECT madlib.create_nb_probs_view( 'categ_feature_probs',
+                                       'class_priors',
+                                       'gaussian_test',
+                                       'id',
+                                       'attributes',
+                                       3,
+                                       'numeric_attr_params',
+                                       'probs_view'
+                                  );
+&#160;
+SELECT * FROM probs_view;
+</pre> Result: <pre class="result">
+ key | class |       nb_prob
+-----+-------+----------------------
+   9 |     1 |    0.993556745948775
+   9 |     2 |  0.00644325405122553
+  10 |     1 | 5.74057538627122e-05
+  10 |     2 |    0.999942594246137
+(4 rows)
+</pre></li>
+</ol>
+<p><a class="anchor" id="background"></a></p><dl class="section 
user"><dt>Technical Background</dt><dd></dd></dl>
+<p>In detail, <b>Bayes'</b> theorem states that </p><p class="formulaDsp">
+\[ \Pr(C = c \mid A_1 = a_1, \dots, A_n = a_n) = \frac{\Pr(C = c) \cdot 
\Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c)} {\Pr(A_1 = a_1, \dots, A_n = a_n)} 
\,, \]
+</p>
+<p> and the <b>naive</b> assumption is that </p><p class="formulaDsp">
+\[ \Pr(A_1 = a_1, \dots, A_n = a_n \mid C = c) = \prod_{i=1}^n \Pr(A_i = a_i 
\mid C = c) \,. \]
+</p>
+<p> Naives Bayes classification estimates feature probabilities and class 
priors using maximum likelihood or Laplacian smoothing. These parameters are 
then used to classifying new data.</p>
+<p>A Naive Bayes classifier computes the following formula: </p><p 
class="formulaDsp">
+\[ \text{classify}(a_1, ..., a_n) = \arg\max_c \left\{ \Pr(C = c) \cdot 
\prod_{i=1}^n \Pr(A_i = a_i \mid C = c) \right\} \]
+</p>
+<p> where \( c \) ranges over all classes in the training data and 
probabilites are estimated with relative frequencies from the training set. 
There are different ways to estimate the feature probabilities \( P(A_i = a 
\mid C = c) \). The maximum likelihood estimate takes the relative frequencies. 
That is: </p><p class="formulaDsp">
+\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a)}{\#c} \]
+</p>
+<p> where</p><ul>
+<li>\( \#(c,i,a) \) denotes the # of training samples where attribute \( i \) 
is \( a \) and class is \( c \)</li>
+<li>\( \#c \) denotes the # of training samples where class is \( c \).</li>
+</ul>
+<p>Since the maximum likelihood sometimes results in estimates of "0", you 
might want to use a "smoothed" estimate. To do this, you add a number of 
"virtual" samples and make the assumption that these samples are evenly 
distributed among the values assumed by attribute \( i \) (that is, the set of 
all values observed for attribute \( a \) for any class):</p>
+<p class="formulaDsp">
+\[ P(A_i = a \mid C = c) = \frac{\#(c,i,a) + s}{\#c + s \cdot \#i} \]
+</p>
+<p> where</p><ul>
+<li>\( \#i \) denotes the # of distinct values for attribute \( i \) (for all 
classes)</li>
+<li>\( s \geq 0 \) denotes the smoothing factor.</li>
+</ul>
+<p>The case \( s = 1 \) is known as "Laplace smoothing". The case \( s = 0 \) 
trivially reduces to maximum-likelihood estimates.</p>
+<p><a class="anchor" id="literature"></a></p><dl class="section 
user"><dt>Literature</dt><dd></dd></dl>
+<p>[1] Tom Mitchell: Machine Learning, McGraw Hill, 1997. Book chapter 
<em>Generativ and Discriminative Classifiers: Naive Bayes and Logistic 
Regression</em> available at: <a 
href="http://www.cs.cmu.edu/~tom/NewChapters.html";>http://www.cs.cmu.edu/~tom/NewChapters.html</a></p>
+<p>[2] Wikipedia, Naive Bayes classifier, <a 
href="http://en.wikipedia.org/wiki/Naive_Bayes_classifier";>http://en.wikipedia.org/wiki/Naive_Bayes_classifier</a></p>
+<p><a class="anchor" id="related"></a></p><dl class="section user"><dt>Related 
Topics</dt><dd>File <a class="el" href="bayes_8sql__in.html" title="SQL 
functions for naive Bayes. ">bayes.sql_in</a> documenting the SQL 
functions.</dd></dl>
+</div><!-- contents -->
+</div><!-- doc-content -->
+<!-- start footer part -->
+<div id="nav-path" class="navpath"><!-- id is needed for treeview function! -->
+  <ul>
+    <li class="footer">Generated on Mon Oct 15 2018 11:24:30 for MADlib by
+    <a href="http://www.doxygen.org/index.html";>
+    <img class="footer" src="doxygen.png" alt="doxygen"/></a> 1.8.14 </li>
+  </ul>
+</div>
+</body>
+</html>

[29/51] [partial] madlib-site git commit: Doc: Add v1.15.1 documentation

Reply via email to