Github user myui commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/52#discussion_r103403546
  
    --- Diff: docs/gitbook/eval/auc.md ---
    @@ -0,0 +1,102 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +<!-- toc -->
    +
    +# Area Under the ROC Curve
    +
    +[ROC 
curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and 
Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., 
positive or negative) classification problems such as [Logistic 
Regression](../binaryclass/a9a_lr.md).
    +
    +Binary classifiers generally predict how likely a sample is to be positive 
by computing probability. Ultimately, we can evaluate the classifiers by 
comparing the probabilities with truth positive/negative labels.
    +
    +Now we assume that there is a table which contains predicted scores (i.e., 
probabilities) and truth labels as follows:
    +
    +| probability<br/>(predicted score) | truth label |
    +|:---:|:---:|
    +| 0.5 | 0 |
    +| 0.3 | 1 |
    +| 0.2 | 0 |
    +| 0.8 | 1 |
    +| 0.7 | 1 |
    +
    +Once the rows are sorted by the probabilities in a descending order, AUC 
gives a metric based on how many positive (`label=1`) samples are ranked higher 
than negative (`label=0`) samples. If many positive rows get larger scores than 
negative rows, AUC would be large, and hence our classifier would perform well.
    +
    +# Compute AUC on Hivemall
    +
    +On Hivemall, a function `auc(double score, int label)` provides a way to 
compute AUC for pairs of probability and truth label.
    +
    +For instance, following query computes AUC of the table which was shown 
above:
    +
    +```sql
    +with data as (
    +  select 0.5 as prob, 0 as label
    +  union all
    +  select 0.3 as prob, 1 as label
    +  union all
    +  select 0.2 as prob, 0 as label
    +  union all
    +  select 0.8 as prob, 1 as label
    +  union all
    +  select 0.7 as prob, 1 as label
    +), data_ordered as (
    +  select prob, label
    +  from data
    +  order by prob desc
    +)
    +select auc(prob, label)
    +from data_ordered;
    +```
    +
    +This query returns `0.83333` as AUC.
    +
    +Since AUC is a metric based on ranked probability-label pairs as mentioned 
above, input data (rows) needs to be ordered by scores in a descending order.
    +
    +Meanwhile, Hive's `distribute by` clause allows you to compute AUC in 
parallel: 
    +
    +```sql
    +with data as (
    +  select 0.5 as prob, 0 as label
    +  union all
    +  select 0.3 as prob, 1 as label
    +  union all
    +  select 0.2 as prob, 0 as label
    +  union all
    +  select 0.8 as prob, 1 as label
    +  union all
    +  select 0.7 as prob, 1 as label
    +), data_ordered as (
    +  select prob, label
    +  from data
    +  order by prob desc
    +)
    +select auc(prob, label)
    +from (
    +  select prob, label
    +  from data_ordered
    +  distribute by floor(prob / 0.2)
    +) t;
    +```
    +
    --- End diff --
    
    Add a note explaining what `floor(prob / 0.2)` is meaning. Distribute AUC 
computation into 5 bins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

Reply via email to