Github user chezou commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r214233539
  
    --- Diff: docs/gitbook/supervised_learning/tutorial.md ---
    @@ -0,0 +1,461 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
    +
    +```sql
    +create table if not exists purchase_history as
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_history;
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
    +
    +```sql
    +select hivemall_version();
    +```
    +
    +> "0.5.1-incubating-SNAPSHOT"
    +
    +Below we list ML and relevant problems that Hivemall can solve:
    +
    +- [Binary and multi-class classification](../binaryclass/general.html)
    +- [Regression](../regression/general.html)
    +- [Recommendation](../recommend/cf.html)
    +- [Anomaly detection](../anomaly/lof.html)
    +- [Natural language processing](../misc/tokenizer.html)
    +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
    +- [Data sketching](../misc/funcs.html#sketching)
    +- Evaluation
    +
    +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
    +
    +This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
    +
    +## Binary classification
    +
    +Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
    +
    +| day\_of\_week | gender | price | category | label |
    +|:---:|:---:|:---:|:---:|:---|
    +|Saturday | male | 600 | book | 1 |
    +|Friday | female | 4800 | sports | 0 |
    +|Friday | other | 18000  | entertainment | 0 |
    +|Thursday | male | 200 | food | 0 |
    +|Wednesday | female | 1000 | electronics | 1 |
    +
    +Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
    +
    +### Step 1. Feature representation
    +
    +First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
    +
    +To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
    +
    +- Quantitative feature: `<index>:<value>`
    +  - e.g., `price:600.0`
    +- Categorical feature: `<index>#<value>`
    +  - e.g., `gender#male`
    +
    --- End diff --
    
    Added 0f593c4


---

Reply via email to