[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

chezou Thu, 30 Aug 2018 19:35:00 -0700

Github user chezou commented on a diff in the pull request:

    https://github.com/apache/incubator-hivemall/pull/158#discussion_r214231425
  
    --- Diff: docs/gitbook/supervised_learning/tutorial.md ---
    @@ -0,0 +1,461 @@
    +<!--
    +  Licensed to the Apache Software Foundation (ASF) under one
    +  or more contributor license agreements.  See the NOTICE file
    +  distributed with this work for additional information
    +  regarding copyright ownership.  The ASF licenses this file
    +  to you under the Apache License, Version 2.0 (the
    +  "License"); you may not use this file except in compliance
    +  with the License.  You may obtain a copy of the License at
    +
    +    http://www.apache.org/licenses/LICENSE-2.0
    +
    +  Unless required by applicable law or agreed to in writing,
    +  software distributed under the License is distributed on an
    +  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
    +  KIND, either express or implied.  See the License for the
    +  specific language governing permissions and limitations
    +  under the License.
    +-->
    +
    +# Step-by-Step Tutorial on Supervised Learning with Apache Hivemall
    +
    +<!-- toc -->
    +
    +## What is Hivemall?
    +
    +[Apache Hive](https://hive.apache.org/) is a data warehousing solution 
that enables us to process large-scale data in the form of SQL easily. Assume 
that you have a table named `purchase_history` which can be artificially 
created as:
    +
    +```sql
    +create table if not exists purchase_history as
    +select 1 as id, "Saturday" as day_of_week, "male" as gender, 600 as price, 
"book" as category, 1 as label
    +union all
    +select 2 as id, "Friday" as day_of_week, "female" as gender, 4800 as 
price, "sports" as category, 0 as label
    +union all
    +select 3 as id, "Friday" as day_of_week, "other" as gender, 18000 as 
price, "entertainment" as category, 0 as label
    +union all
    +select 4 as id, "Thursday" as day_of_week, "male" as gender, 200 as price, 
"food" as category, 0 as label
    +union all
    +select 5 as id, "Wednesday" as day_of_week, "female" as gender, 1000 as 
price, "electronics" as category, 1 as label
    +;
    +```
    +
    +The syntax of Hive queries, namely **HiveQL**, is very similar to SQL:
    +
    +```sql
    +select count(1) from purchase_history;
    +```
    +
    +> 5
    +
    +[Apache Hivemall](https://github.com/apache/incubator-hivemall) is a 
collection of user-defined functions (UDFs) for HiveQL which is strongly 
optimized for machine learning (ML) and data science. To give an example, you 
can efficiently build a logistic regression model with the stochastic gradient 
descent (SGD) optimization by issuing the following ~10 lines of query:
    +
    +```sql
    +SELECT
    +  train_classifier(
    +    features,
    +    label,
    +    '-loss_function logloss -optimizer SGD'
    +  ) as (feature, weight)
    +FROM
    +  training
    +;
    +```
    +
    +
    +Hivemall function [`hivemall_version()`](../misc/funcs.html#others) shows 
current Hivemall version, for example:
    +
    +```sql
    +select hivemall_version();
    +```
    +
    +> "0.5.1-incubating-SNAPSHOT"
    +
    +Below we list ML and relevant problems that Hivemall can solve:
    +
    +- [Binary and multi-class classification](../binaryclass/general.html)
    +- [Regression](../regression/general.html)
    +- [Recommendation](../recommend/cf.html)
    +- [Anomaly detection](../anomaly/lof.html)
    +- [Natural language processing](../misc/tokenizer.html)
    +- [Clustering](../misc/tokenizer.html) (i.e., topic modeling)
    +- [Data sketching](../misc/funcs.html#sketching)
    +- Evaluation
    +
    +Our [YouTube demo video](https://www.youtube.com/watch?v=cMUsuA9KZ_c) 
would be helpful to understand more about an overview of Hivemall.
    +
    +This tutorial explains the basic usage of Hivemall with examples of 
supervised learning of simple regressor and binary classifier.
    +
    +## Binary classification
    +
    +Imagine a scenario that we like to build a binary classifier from the mock 
`purchase_history` data and predict unforeseen purchases to conduct a new 
campaign effectively:
    +
    +| day\_of\_week | gender | price | category | label |
    +|:---:|:---:|:---:|:---:|:---|
    +|Saturday | male | 600 | book | 1 |
    +|Friday | female | 4800 | sports | 0 |
    +|Friday | other | 18000  | entertainment | 0 |
    +|Thursday | male | 200 | food | 0 |
    +|Wednesday | female | 1000 | electronics | 1 |
    +
    +Use Hivemall 
[`train_classifier()`](../misc/funcs.html#binary-classification) UDF to tackle 
the problem as follows.
    +
    +### Step 1. Feature representation
    +
    +First of all, we have to convert the records into pairs of the feature 
vector and corresponding target value. Here, Hivemall requires you to represent 
input features in a specific format.
    +
    +To be more precise, Hivemall represents single feature in a concatenation 
of **index** (i.e., **name**) and its **value**:
    +
    +- Quantitative feature: `<index>:<value>`
    +  - e.g., `price:600.0`
    +- Categorical feature: `<index>#<value>`
    +  - e.g., `gender#male`
    +
    +Each of those features is a string value in Hive, and "feature vector" 
means an array of string values like:
    +
    +```
    +["price:600.0", "day of week#Saturday", "gender#male", "category#book"]
    +```
    +
    +See also more detailed [document for input 
format](../getting_started/input-format.html)).
    +
    +Therefore, what we first need to do is to convert the records into an 
array of feature strings, and Hivemall functions 
[`quantitative_features()`](../getting_started/input-format.html#quantitative-features),
 
[`categorical_features()`](../getting_started/input-format.html#categorical-features)
 and [`array_concat()`](../misc/generic_funcs.html#array) provide a simple way 
to create the pairs of feature vector and target value:
    +
    +```sql
    +create table if not exists training as
    +select
    +  id,
    +  array_concat( -- concatenate two arrays of quantitative and categorical 
features into single array
    +    quantitative_features(
    +      array("price"), -- quantitative feature names
    +      price -- corresponding column names
    +    ),
    +    categorical_features(
    +      array("day of week", "gender", "category"), -- categorical feature 
names
    +      day_of_week, gender, category -- corresponding column names
    +    )
    +  ) as features,
    +  label
    +from
    +  purchase_history
    +;
    +```
    +
    +The training table is as follows:
    +
    +|id | features |  label |
    +|:---:|:---|:---|
    +|1 |["price:600.0","day of week#Saturday","gender#male","category#book"] | 
1 |
    +|2 |["price:4800.0","day of 
week#Friday","gender#female","category#sports"] |  0 |
    +|3 |["price:18000.0","day of 
week#Friday","gender#other","category#entertainment"]| 0 |
    +|4 |["price:200.0","day of week#Thursday","gender#male","category#food"] | 
0 |
    +|5 |["price:1000.0","day of 
week#Wednesday","gender#female","category#electronics"]| 1 |
    +
    +The output table `training` will be directly used as an input to 
Hivemall's ML functions in the next step.
    +
    +Note that you can apply extra Hivemall functions (e.g., 
[`rescale()`](../misc/funcs.html#feature-scaling), 
[`feature_hashing()`](../misc/funcs.html#feature-hashing), 
[`l1_normalize()`](../misc/funcs.html#feature-scaling)) for the features in 
this step to make your prediction model more accurate and stable; it is known 
as *feature engineering* in the context of ML. See our 
[documentation](../ft_engineering/scaling.html) for more information.
    +
    +### Step 2. Training
    +
    +Once the original table `purchase_history` has been converted into pairs 
of `features` and `label`, you can build a binary classifier by running the 
following query:
    +
    +```sql
    +create table if not exists classifier as
    +select
    +  train_classifier(
    +    features, -- feature vector
    +    label, -- target value
    +    '-loss_function logloss -optimizer SGD -regularization l1' -- 
hyper-parameters
    +  ) as (feature, weight)
    +from
    +  training
    +;
    +```
    +
    +What the above query does is to build a binary classifier with:
    +
    +- `-loss_function logloss`
    +  - Use logistic loss i.e., logistic regression
    +- `-optimizer SGD`
    +  - Learn model parameters with the SGD optimization
    +- `-regularization l1`
    +  - Apply L1 regularization
    +
    +Eventually, the output table `classifier` stores model parameters as:
    +
    +| feature | weight |
    +|:---:|:---:|
    +| day of week#Wednesday | 0.7443372011184692 |
    +| day of week#Thursday | 1.415687620465178e-07 |
    +| day of week#Friday | -0.2697019577026367 |
    +| day of week#Saturday | 0.7337419390678406 |
    +| category#book | 0.7337419390678406 |
    +| category#electronics | 0.7443372011184692 |
    +| category#entertainment | 5.039264578954317e-07 |
    +| category#food | 1.415687620465178e-07 |
    +| category#sports | -0.2697771489620209 |
    +| gender#male | 0.7336684465408325 |
    +| gender#female | 0.47442761063575745 |
    +| gender#other | 5.039264578954317e-07 |
    +| price | -110.62307739257812 |
    +
    +Notice that weight is learned for each possible value in a categorical 
feature, and for every single quantitative feature.
    +
    +Of course, you can optimize hyper-parameters to build more accurate 
prediction model. Check the output of the following query to see all available 
options, including learning rate, number of iterations and regularization 
parameters, and their default values:
    +
    +```sql
    +select train_classifier(array(), 0, '-help');
    +```
    +
    +### Step 3. Prediction
    +
    +Now, the table `classifier` has liner coefficients for given features, and 
we can predict unforeseen samples by computing a weighted sum of their features.
    +
    +How about the probability of purchase by a `male` customer who sees a 
`food` product priced at `120` on `Friday`? Which product is more likely to be 
purchased by the customer on `Friday`?
    +
    +To differentiate potential purchases, create a `unforeseen_samples` table 
with these unknown combinations of features:
    +
    +```sql
    +create table if not exists unforeseen_samples as
    +select 1 as id, array("gender#male", "category#food", "day of 
week#Friday", "price:120") as features
    +union all
    +select 2 as id, array("gender#male", "category#sports", "day of 
week#Friday", "price:1000") as features
    +union all
    +select 3 as id, array("gender#male", "category#electronics", "day of 
week#Friday", "price:540") as features
    +;
    +```
    +
    +Prediction for the feature vectors can be made by join operation between 
`unforeseen_samples` and `classifier` on each feature as:
    +
    +```sql
    +with features_exploded as (
    +  select
    +    id,
    +    -- split feature string into its name and value
    +    -- to join with a model table
    +    extract_feature(fv) as feature,
    +    extract_weight(fv) as value
    +  from unforeseen_samples t1 LATERAL VIEW explode(features) t2 as fv
    +)
    +select
    +  t1.id,
    +  sigmoid( sum(p1.weight * t1.value) ) as probability
    +from
    +  features_exploded t1
    +  LEFT OUTER JOIN classifier p1 ON (t1.feature = p1.feature)
    +group by
    +  t1.id
    +;
    +```
    +
    +Output for single sample can be:
    +
    +|id| probability|
    +|---:|---:|
    +| 1| 1.0261879540562902e-10|
    +
    +### Evaluation
    +
    +If you have test samples for evaluation, use Hivemall's [evaluation 
UDFs](../eval/binary_classification_measures.html) to measure the accuracy of 
prediction.
    +
    +For instance, prediction accuracy over the `training` samples can be 
measured as:
    +
    +```sql
    +with features_exploded as (
    +  select
    +    id,
    +    extract_feature(fv) as feature,
    +    extract_weight(fv) as value
    +  from training t1 LATERAL VIEW explode(features) t2 as fv
    +),
    +predictions as (
    +  select
    +    t1.id,
    +    sigmoid( sum(p1.weight * t1.value) ) as probability
    --- End diff --
    
    agreed. I will add the note for sigmoid at [the first 
appearance](https://github.com/apache/incubator-hivemall/pull/158/files#diff-00bf5d6dc412242452a06beca32c2f08R241)
 in this doc.

---

[GitHub] incubator-hivemall pull request #158: [HIVEMALL-215] Add step-by-step tutori...

Reply via email to