[
https://issues.apache.org/jira/browse/FLINK-2030?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14704863#comment-14704863
]
ASF GitHub Bot commented on FLINK-2030:
---------------------------------------
Github user tillrohrmann commented on a diff in the pull request:
https://github.com/apache/flink/pull/861#discussion_r37527518
--- Diff: docs/libs/ml/statistics.md ---
@@ -0,0 +1,69 @@
+---
+mathjax: include
+htmlTitle: FlinkML - Statistics
+title: <a href="../ml">FlinkML</a> - Statistics
+---
+<!--
+Licensed to the Apache Software Foundation (ASF) under one
+or more contributor license agreements. See the NOTICE file
+distributed with this work for additional information
+regarding copyright ownership. The ASF licenses this file
+to you under the Apache License, Version 2.0 (the
+"License"); you may not use this file except in compliance
+with the License. You may obtain a copy of the License at
+
+ http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing,
+software distributed under the License is distributed on an
+"AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+KIND, either express or implied. See the License for the
+specific language governing permissions and limitations
+under the License.
+-->
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The statistics utility provides features such as building histograms over
data.
+
+## Methods
+
+ The Statistics utility provides two major functions: `createHistogram` and
+ `createDiscreteHistogram`.
+
+### Creating a histogram
+
+ There are two types of histograms:
+ 1. **Continuous Histograms**: These histograms are formed on a data set
`X: DataSet[Double]`
+ when the values in `X` are from a continuous range. These histograms
support
+ `quantile` and `sum` operations. Here `quantile(q)` refers to a value
$x_q$ such that $|x: x
+ \leq x_q| = q * |X|$. Further, `sum(s)` refers to the number of
elements $x \leq s$, which can
+ be construed as a cumulative probability value at $s$[Of course,
*scaled* probability].
+ 2. A continuous histogram can be formed by calling
`X.createHistogram(b)` where `b` is the
+ number of bins.
+ **Discrete Histograms**: These histograms are formed on a data set
`X:DataSet[Double]`
+ when the values in `X` are from a discrete distribution. These
histograms
+ support `count(c)` operation which returns the number of elements
associated with cateogry `c`.
+ <br>
--- End diff --
html tags should be replaced by markdown syntax
> Implement an online histogram with Merging and equalization features
> --------------------------------------------------------------------
>
> Key: FLINK-2030
> URL: https://issues.apache.org/jira/browse/FLINK-2030
> Project: Flink
> Issue Type: Sub-task
> Components: Machine Learning Library
> Reporter: Sachin Goel
> Assignee: Sachin Goel
> Priority: Minor
> Labels: ML
>
> For the implementation of the decision tree in
> https://issues.apache.org/jira/browse/FLINK-1727, we need to implement an
> histogram with online updates, merging and equalization features. A reference
> implementation is provided in [1]
> [1].http://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)