Github user chezou commented on a diff in the pull request: https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233539 --- Diff: docs/gitbook/supervised_learning/tutorial.md --- @@ -0,0 +1,461 @@ +<!-- + Licensed to the Apache Software Foundation (ASF) under one + or more contributor license agreements. See the NOTICE file + distributed with this work for additional information + regarding copyright ownership. The ASF licenses this file + to you under the Apache License, Version 2.0 (the + "License"); you may not use this file except in compliance + with the License. You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, + software distributed under the License is distributed on an + "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY + KIND, either express or implied. See the License for the + specific language governing permissions and limitations + under the License. +--> + +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall + +<!-- toc --> + +## What is Hivemall? + +[Apache Hive](https://hive.apache.org/) is a data warehousing solution that enables us to process large-scale data in the form of SQL easily. Assume that you have a table named `purchase_history` which can be artificially created as: + +```sql +create table if not exists purchase_history as +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, "book" as category, 1 as label +union all +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as price, "sports" as category, 0 as label +union all +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as price, "entertainment" as category, 0 as label +union all +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, "food" as category, 0 as label +union all +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as price, "electronics" as category, 1 as label +; +``` + +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL: + +```sql +select count(1) from purchase_history; +``` + +> 5 + +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a collection of user-defined functions (UDFs) for HiveQL which is strongly optimized for machine learning (ML) and data science. To give an example, you can efficiently build a logistic regression model with the stochastic gradient descent (SGD) optimization by issuing the following ~10 lines of query: + +```sql +SELECT + train_classifier( + features, + label, + '-loss_function logloss -optimizer SGD' + ) as (feature, weight) +FROM + training +; +``` + + +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows current Hivemall version, for example: + +```sql +select hivemall_version(); +``` + +> "0.5.1-incubating-SNAPSHOT" + +Below we list ML and relevant problems that Hivemall can solve: + +- [Binary and multi-class classification](../binaryclass/general.html) +- [Regression](../regression/general.html) +- [Recommendation](../recommend/cf.html) +- [Anomaly detection](../anomaly/lof.html) +- [Natural language processing](../misc/tokenizer.html) +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling) +- [Data sketching](../misc/funcs.html#sketching) +- Evaluation + +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) would be helpful to understand more about an overview of Hivemall. + +This tutorial explains the basic usage of Hivemall with examples of supervised learning of simple regressor and binary classifier. + +## Binary classification + +Imagine a scenario that we like to build a binary classifier from the mock `purchase_history` data and predict unforeseen purchases to conduct a new campaign effectively: + +| day\_of\_week | gender | price | category | label | +|:---:|:---:|:---:|:---:|:---| +|Saturday | male | 600 | book | 1 | +|Friday | female | 4800 | sports | 0 | +|Friday | other | 18000 | entertainment | 0 | +|Thursday | male | 200 | food | 0 | +|Wednesday | female | 1000 | electronics | 1 | + +Use Hivemall [`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle the problem as follows. + +### Step 1. Feature representation + +First of all, we have to convert the records into pairs of the feature vector and corresponding target value. Here, Hivemall requires you to represent input features in a specific format. + +To be more precise, Hivemall represents single feature in a concatenation of **index** (i.e., **name**) and its **value**: + +- Quantitative feature: `<index>:<value>` + - e.g., `price:600.0` +- Categorical feature: `<index>#<value>` + - e.g., `gender#male` + --- End diff -- Added 0f593c4
---