[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103403546
  
--- Diff: docs/gitbook/eval/auc.md ---
@@ -0,0 +1,102 @@
+
+
+
+
+# Area Under the ROC Curve
+
+[ROC 
curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and 
Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., 
positive or negative) classification problems such as [Logistic 
Regression](../binaryclass/a9a_lr.md).
+
+Binary classifiers generally predict how likely a sample is to be positive 
by computing probability. Ultimately, we can evaluate the classifiers by 
comparing the probabilities with truth positive/negative labels.
+
+Now we assume that there is a table which contains predicted scores (i.e., 
probabilities) and truth labels as follows:
+
+| probability(predicted score) | truth label |
+|:---:|:---:|
+| 0.5 | 0 |
+| 0.3 | 1 |
+| 0.2 | 0 |
+| 0.8 | 1 |
+| 0.7 | 1 |
+
+Once the rows are sorted by the probabilities in a descending order, AUC 
gives a metric based on how many positive (`label=1`) samples are ranked higher 
than negative (`label=0`) samples. If many positive rows get larger scores than 
negative rows, AUC would be large, and hence our classifier would perform well.
+
+# Compute AUC on Hivemall
+
+On Hivemall, a function `auc(double score, int label)` provides a way to 
compute AUC for pairs of probability and truth label.
+
+For instance, following query computes AUC of the table which was shown 
above:
+
+```sql
+with data as (
+  select 0.5 as prob, 0 as label
+  union all
+  select 0.3 as prob, 1 as label
+  union all
+  select 0.2 as prob, 0 as label
+  union all
+  select 0.8 as prob, 1 as label
+  union all
+  select 0.7 as prob, 1 as label
+), data_ordered as (
+  select prob, label
+  from data
+  order by prob desc
+)
+select auc(prob, label)
+from data_ordered;
+```
+
+This query returns `0.8` as AUC.
+
+Since AUC is a metric based on ranked probability-label pairs as mentioned 
above, input data (rows) needs to be ordered by scores in a descending order.
+
+Meanwhile, Hive's `distribute by` clause allows you to compute AUC in 
parallel: 
+
+```sql
+with data as (
+  select 0.5 as prob, 0 as label
+  union all
+  select 0.3 as prob, 1 as label
+  union all
+  select 0.2 as prob, 0 as label
+  union all
+  select 0.8 as prob, 1 as label
+  union all
+  select 0.7 as prob, 1 as label
+), data_ordered as (
+  select prob, label
+  from data
+  order by prob desc
+)
+select auc(prob, label)
+from (
+  select prob, label
+  from data_ordered
+  distribute by floor(prob / 0.2)
+) t;
+```
+
--- End diff --

Add a note explaining what `floor(prob / 0.2)` is meaning. Distribute AUC 
computation into 5 bins.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103402140
  
--- Diff: core/src/main/java/hivemall/evaluation/AUCUDAF.java ---
@@ -49,35 +51,264 @@
 @SuppressWarnings("deprecation")
 @Description(
 name = "auc",
-value = "_FUNC_(array rankItems, array correctItems [, const int 
recommendSize = rankItems.size])"
+value = "_FUNC_(array rankItems | double score, array correctItems 
| int label "
++ "[, const int recommendSize = rankItems.size ])"
 + " - Returns AUC")
 public final class AUCUDAF extends AbstractGenericUDAFResolver {
 
-// prevent instantiation
-private AUCUDAF() {}
-
 @Override
 public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) 
throws SemanticException {
 if (typeInfo.length != 2 && typeInfo.length != 3) {
 throw new UDFArgumentTypeException(typeInfo.length - 1,
 "_FUNC_ takes two or three arguments");
 }
 
-ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
-if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
-throw new UDFArgumentTypeException(0,
-"The first argument `array rankItems` is invalid form: " + 
typeInfo[0]);
+if (HiveUtils.isNumberTypeInfo(typeInfo[0]) && 
HiveUtils.isIntegerTypeInfo(typeInfo[1])) {
+return new ClassificationEvaluator();
+} else {
+ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(0,
+"The first argument `array rankItems` is invalid form: 
" + typeInfo[0]);
+}
+
+ListTypeInfo arg2type = HiveUtils.asListTypeInfo(typeInfo[1]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg2type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(1,
+"The second argument `array correctItems` is invalid 
form: " + typeInfo[1]);
+}
+
+return new RankingEvaluator();
+}
+}
+
+public static class ClassificationEvaluator extends 
GenericUDAFEvaluator {
+
+private PrimitiveObjectInspector scoreOI;
+private PrimitiveObjectInspector labelOI;
+
+private StructObjectInspector internalMergeOI;
+private StructField aField;
+private StructField scorePrevField;
+private StructField fpField;
+private StructField tpField;
+private StructField fpPrevField;
+private StructField tpPrevField;
+
+public ClassificationEvaluator() {}
+
+@Override
+public ObjectInspector init(Mode mode, ObjectInspector[] 
parameters) throws HiveException {
+assert (parameters.length == 2 || parameters.length == 3) : 
parameters.length;
+super.init(mode, parameters);
+
+// initialize input
+if (mode == Mode.PARTIAL1 || mode == Mode.COMPLETE) {// from 
original data
+this.scoreOI = 
HiveUtils.asDoubleCompatibleOI(parameters[0]);
+this.labelOI = HiveUtils.asIntegerOI(parameters[1]);
+} else {// from partial aggregation
+StructObjectInspector soi = (StructObjectInspector) 
parameters[0];
+this.internalMergeOI = soi;
+this.aField = soi.getStructFieldRef("a");
+this.scorePrevField = soi.getStructFieldRef("scorePrev");
+this.fpField = soi.getStructFieldRef("fp");
+this.tpField = soi.getStructFieldRef("tp");
+this.fpPrevField = soi.getStructFieldRef("fpPrev");
+this.tpPrevField = soi.getStructFieldRef("tpPrev");
+}
+
+// initialize output
+final ObjectInspector outputOI;
+if (mode == Mode.PARTIAL1 || mode == Mode.PARTIAL2) {// 
terminatePartial
+outputOI = internalMergeOI();
+} else {// terminate
+outputOI = 
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
+}
+return outputOI;
+}
+
+private static StructObjectInspector internalMergeOI() {
+ArrayList fieldNames = new ArrayList();
+ArrayList fieldOIs = new 
ArrayList();
+
+fieldNames.add("a");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+

[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-28 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103402767
  
--- Diff: docs/gitbook/eval/auc.md ---
@@ -0,0 +1,102 @@
+
+
+
+
+# Area Under the ROC Curve
+
+[ROC 
curve](https://en.wikipedia.org/wiki/Receiver_operating_characteristic) and 
Area Under the ROC Curve (AUC) are widely-used metric for binary (i.e., 
positive or negative) classification problems such as [Logistic 
Regression](../binaryclass/a9a_lr.md).
--- End diff --

fix the link to `../binaryclass/a9a_lr.html`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-28 Thread takuti
Github user takuti commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103401513
  
--- Diff: core/src/main/java/hivemall/evaluation/AUCUDAF.java ---
@@ -49,35 +50,251 @@
 @SuppressWarnings("deprecation")
 @Description(
 name = "auc",
-value = "_FUNC_(array rankItems, array correctItems [, const int 
recommendSize = rankItems.size])"
+value = "_FUNC_(array rankItems | double score, array correctItems 
| double label "
++ "[, const int recommendSize = rankItems.size ])"
 + " - Returns AUC")
 public final class AUCUDAF extends AbstractGenericUDAFResolver {
 
-// prevent instantiation
-private AUCUDAF() {}
-
 @Override
 public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) 
throws SemanticException {
 if (typeInfo.length != 2 && typeInfo.length != 3) {
 throw new UDFArgumentTypeException(typeInfo.length - 1,
 "_FUNC_ takes two or three arguments");
 }
 
-ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
-if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
-throw new UDFArgumentTypeException(0,
-"The first argument `array rankItems` is invalid form: " + 
typeInfo[0]);
+if (HiveUtils.isNumberTypeInfo(typeInfo[0]) && 
HiveUtils.isNumberTypeInfo(typeInfo[1])) {
+return new ClassificationEvaluator();
+} else {
+ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(0,
+"The first argument `array rankItems` is invalid form: 
" + typeInfo[0]);
+}
+
+ListTypeInfo arg2type = HiveUtils.asListTypeInfo(typeInfo[1]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg2type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(1,
+"The second argument `array correctItems` is invalid 
form: " + typeInfo[1]);
+}
+
+return new RankingEvaluator();
+}
+}
+
+public static class ClassificationEvaluator extends 
GenericUDAFEvaluator {
+
+private PrimitiveObjectInspector scoreOI;
+private PrimitiveObjectInspector labelOI;
+
+private StructObjectInspector internalMergeOI;
+private StructField aField;
+private StructField scorePrevField;
+private StructField fpField;
+private StructField tpField;
+private StructField fpPrevField;
+private StructField tpPrevField;
+
+public ClassificationEvaluator() {}
+
+@Override
+public ObjectInspector init(Mode mode, ObjectInspector[] 
parameters) throws HiveException {
+assert (parameters.length == 2 || parameters.length == 3) : 
parameters.length;
+super.init(mode, parameters);
+
+// initialize input
+if (mode == Mode.PARTIAL1 || mode == Mode.COMPLETE) {// from 
original data
+this.scoreOI = (PrimitiveObjectInspector) parameters[0];
+this.labelOI = (PrimitiveObjectInspector) parameters[1];
+} else {// from partial aggregation
+StructObjectInspector soi = (StructObjectInspector) 
parameters[0];
+this.internalMergeOI = soi;
+this.aField = soi.getStructFieldRef("a");
+this.scorePrevField = soi.getStructFieldRef("scorePrev");
+this.fpField = soi.getStructFieldRef("fp");
+this.tpField = soi.getStructFieldRef("tp");
+this.fpPrevField = soi.getStructFieldRef("fpPrev");
+this.tpPrevField = soi.getStructFieldRef("tpPrev");
+}
+
+// initialize output
+final ObjectInspector outputOI;
+if (mode == Mode.PARTIAL1 || mode == Mode.PARTIAL2) {// 
terminatePartial
+outputOI = internalMergeOI();
+} else {// terminate
+outputOI = 
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
+}
+return outputOI;
+}
+
+private static StructObjectInspector internalMergeOI() {
+ArrayList fieldNames = new ArrayList();
+ArrayList fieldOIs = new 
ArrayList();
+
+fieldNames.add("a");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+

[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-27 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103388346
  
--- Diff: core/src/main/java/hivemall/evaluation/AUCUDAF.java ---
@@ -49,35 +50,251 @@
 @SuppressWarnings("deprecation")
 @Description(
 name = "auc",
-value = "_FUNC_(array rankItems, array correctItems [, const int 
recommendSize = rankItems.size])"
+value = "_FUNC_(array rankItems | double score, array correctItems 
| double label "
++ "[, const int recommendSize = rankItems.size ])"
 + " - Returns AUC")
 public final class AUCUDAF extends AbstractGenericUDAFResolver {
 
-// prevent instantiation
-private AUCUDAF() {}
-
 @Override
 public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) 
throws SemanticException {
 if (typeInfo.length != 2 && typeInfo.length != 3) {
 throw new UDFArgumentTypeException(typeInfo.length - 1,
 "_FUNC_ takes two or three arguments");
 }
 
-ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
-if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
-throw new UDFArgumentTypeException(0,
-"The first argument `array rankItems` is invalid form: " + 
typeInfo[0]);
+if (HiveUtils.isNumberTypeInfo(typeInfo[0]) && 
HiveUtils.isNumberTypeInfo(typeInfo[1])) {
--- End diff --

`&& HiveUtils.isIntegerTypeInfo(typeInfo[1])`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-27 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103387903
  
--- Diff: core/src/main/java/hivemall/evaluation/AUCUDAF.java ---
@@ -49,35 +50,251 @@
 @SuppressWarnings("deprecation")
 @Description(
 name = "auc",
-value = "_FUNC_(array rankItems, array correctItems [, const int 
recommendSize = rankItems.size])"
+value = "_FUNC_(array rankItems | double score, array correctItems 
| double label "
++ "[, const int recommendSize = rankItems.size ])"
 + " - Returns AUC")
 public final class AUCUDAF extends AbstractGenericUDAFResolver {
 
-// prevent instantiation
-private AUCUDAF() {}
-
 @Override
 public GenericUDAFEvaluator getEvaluator(@Nonnull TypeInfo[] typeInfo) 
throws SemanticException {
 if (typeInfo.length != 2 && typeInfo.length != 3) {
 throw new UDFArgumentTypeException(typeInfo.length - 1,
 "_FUNC_ takes two or three arguments");
 }
 
-ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
-if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
-throw new UDFArgumentTypeException(0,
-"The first argument `array rankItems` is invalid form: " + 
typeInfo[0]);
+if (HiveUtils.isNumberTypeInfo(typeInfo[0]) && 
HiveUtils.isNumberTypeInfo(typeInfo[1])) {
+return new ClassificationEvaluator();
+} else {
+ListTypeInfo arg1type = HiveUtils.asListTypeInfo(typeInfo[0]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg1type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(0,
+"The first argument `array rankItems` is invalid form: 
" + typeInfo[0]);
+}
+
+ListTypeInfo arg2type = HiveUtils.asListTypeInfo(typeInfo[1]);
+if 
(!HiveUtils.isPrimitiveTypeInfo(arg2type.getListElementTypeInfo())) {
+throw new UDFArgumentTypeException(1,
+"The second argument `array correctItems` is invalid 
form: " + typeInfo[1]);
+}
+
+return new RankingEvaluator();
+}
+}
+
+public static class ClassificationEvaluator extends 
GenericUDAFEvaluator {
+
+private PrimitiveObjectInspector scoreOI;
+private PrimitiveObjectInspector labelOI;
+
+private StructObjectInspector internalMergeOI;
+private StructField aField;
+private StructField scorePrevField;
+private StructField fpField;
+private StructField tpField;
+private StructField fpPrevField;
+private StructField tpPrevField;
+
+public ClassificationEvaluator() {}
+
+@Override
+public ObjectInspector init(Mode mode, ObjectInspector[] 
parameters) throws HiveException {
+assert (parameters.length == 2 || parameters.length == 3) : 
parameters.length;
+super.init(mode, parameters);
+
+// initialize input
+if (mode == Mode.PARTIAL1 || mode == Mode.COMPLETE) {// from 
original data
+this.scoreOI = (PrimitiveObjectInspector) parameters[0];
+this.labelOI = (PrimitiveObjectInspector) parameters[1];
+} else {// from partial aggregation
+StructObjectInspector soi = (StructObjectInspector) 
parameters[0];
+this.internalMergeOI = soi;
+this.aField = soi.getStructFieldRef("a");
+this.scorePrevField = soi.getStructFieldRef("scorePrev");
+this.fpField = soi.getStructFieldRef("fp");
+this.tpField = soi.getStructFieldRef("tp");
+this.fpPrevField = soi.getStructFieldRef("fpPrev");
+this.tpPrevField = soi.getStructFieldRef("tpPrev");
+}
+
+// initialize output
+final ObjectInspector outputOI;
+if (mode == Mode.PARTIAL1 || mode == Mode.PARTIAL2) {// 
terminatePartial
+outputOI = internalMergeOI();
+} else {// terminate
+outputOI = 
PrimitiveObjectInspectorFactory.writableDoubleObjectInspector;
+}
+return outputOI;
+}
+
+private static StructObjectInspector internalMergeOI() {
+ArrayList fieldNames = new ArrayList();
+ArrayList fieldOIs = new 
ArrayList();
+
+fieldNames.add("a");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+

[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-27 Thread myui
Github user myui commented on a diff in the pull request:

https://github.com/apache/incubator-hivemall/pull/52#discussion_r103386937
  
--- Diff: core/src/test/java/hivemall/evaluation/AUCUDAFTest.java ---
@@ -0,0 +1,218 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ *   http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing,
+ * software distributed under the License is distributed on an
+ * "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+ * KIND, either express or implied.  See the License for the
+ * specific language governing permissions and limitations
+ * under the License.
+ */
+package hivemall.evaluation;
+
+import org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvaluator;
+import 
org.apache.hadoop.hive.ql.udf.generic.SimpleGenericUDAFParameterInfo;
+import org.apache.hadoop.hive.serde2.objectinspector.ObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorFactory;
+import 
org.apache.hadoop.hive.serde2.objectinspector.PrimitiveObjectInspector;
+import 
org.apache.hadoop.hive.serde2.objectinspector.primitive.PrimitiveObjectInspectorFactory;
+
+import java.util.ArrayList;
+import java.util.Arrays;
+
+import org.junit.Assert;
+import org.junit.Before;
+import org.junit.Test;
+
+public class AUCUDAFTest {
+AUCUDAF auc;
+GenericUDAFEvaluator evaluator;
+ObjectInspector[] inputOIs;
+ObjectInspector[] partialOI;
+AUCUDAF.ClassificationAUCAggregationBuffer agg;
+
+@Before
+public void setUp() throws Exception {
+auc = new AUCUDAF();
+
+inputOIs = new ObjectInspector[] {
+
PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(
+PrimitiveObjectInspector.PrimitiveCategory.DOUBLE),
+
PrimitiveObjectInspectorFactory.getPrimitiveJavaObjectInspector(
+
PrimitiveObjectInspector.PrimitiveCategory.DOUBLE)};
+
+evaluator = auc.getEvaluator(new 
SimpleGenericUDAFParameterInfo(inputOIs, false, false));
+
+ArrayList fieldNames = new ArrayList();
+ArrayList fieldOIs = new 
ArrayList();
+fieldNames.add("a");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+fieldNames.add("scorePrev");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableDoubleObjectInspector);
+fieldNames.add("fp");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableLongObjectInspector);
+fieldNames.add("tp");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableLongObjectInspector);
+fieldNames.add("fpPrev");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableLongObjectInspector);
+fieldNames.add("tpPrev");
+
fieldOIs.add(PrimitiveObjectInspectorFactory.writableLongObjectInspector);
+
+partialOI = new ObjectInspector[2];
+partialOI[0] = 
ObjectInspectorFactory.getStandardStructObjectInspector(fieldNames, fieldOIs);
+
+agg = (AUCUDAF.ClassificationAUCAggregationBuffer) 
evaluator.getNewAggregationBuffer();
+}
+
+@Test
+public void test() throws Exception {
+// should be sorted by scores in a descending order
+final double[] scores = new double[] {0.8, 0.7, 0.5, 0.3, 0.2};
+final double[] labels = new double[] {1, 1, 0, 1, 0};
+
+evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, inputOIs);
+evaluator.reset(agg);
+
+for (int i = 0; i < scores.length; i++) {
+evaluator.iterate(agg, new Object[] {scores[i], labels[i]});
+}
+
+Assert.assertEquals(0.8, agg.get(), 1e-5);
+}
+
+@Test
+public void testAllTruePositive() throws Exception {
+final double[] scores = new double[] {0.8, 0.7, 0.5, 0.3, 0.2};
+final double[] labels = new double[] {1, 1, 1, 1, 1};
+
+evaluator.init(GenericUDAFEvaluator.Mode.PARTIAL1, inputOIs);
+evaluator.reset(agg);
+
+for (int i = 0; i < scores.length; i++) {
+evaluator.iterate(agg, new Object[] {scores[i], labels[i]});
+}
+
+// AUC for all TP scores 

[GitHub] incubator-hivemall pull request #52: [HIVEMALL-78] Implement AUC UDAF for bi...

2017-02-27 Thread takuti
GitHub user takuti opened a pull request:

https://github.com/apache/incubator-hivemall/pull/52

[HIVEMALL-78] Implement AUC UDAF for binary classification

## What changes were proposed in this pull request?

In addition to current `auc(array, array)` for ranking (myui/hivemall#326), 
this patch supports `auc(double, double)` for binary classification.

## What type of PR is it?

Feature

## What is the Jira issue?

https://issues.apache.org/jira/browse/HIVEMALL-78

## How was this patch tested?

Created unit test for the UDAF, and passed:

```
$ mvn -Dtest=hivemall.evaluation.AUCUDAFTest test
```

Moreover, I have launched manual tests by the following queries:

```sql
with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
), data_ordered as (
  select prob, label
  from data
  order by prob desc
)
select auc(prob, label)
from (
  select prob, label
  from data_ordered
  distribute by floor(prob / 0.2)
) t;
```

```sql
with data as (
  select 0.5 as prob, 0 as label
  union all
  select 0.3 as prob, 1 as label
  union all
  select 0.2 as prob, 0 as label
  union all
  select 0.8 as prob, 1 as label
  union all
  select 0.7 as prob, 1 as label
), data_ordered as (
  select prob, label
  from data
  order by prob desc
)
select auc(prob, label)
from data_ordered;
```

Both showed `AUC=0.8`. This result is same as [scikit-learn's 
roc_auc_score()](http://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html):

```
>>> roc_auc_score([0,1,0,1,1],[0.5,0.3,0.2,0.8,0.7])
0.83326
```

## How to use this feature?

See above queries. Input data needs to be ordered by scores in a descending 
order.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/takuti/incubator-hivemall auc

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/incubator-hivemall/pull/52.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #52


commit e60ff231e07aa515666ec7f4863ed1c8401e0e27
Author: Takuya Kitazawa 
Date:   2017-02-28T06:08:33Z

Implement AUCUDAF

commit 4756f463700740af0bd51ab7a25e383649a2d504
Author: Takuya Kitazawa 
Date:   2017-02-28T06:09:18Z

Add unit test of AUCUDAF for classification




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---