Hi MADlib Developers,

I have implemented Steffen Rendle's Factorization Machines
<http://www.csie.ntu.edu.tw/~b97053/paper/Rendle2010FM.pdf> on Python and
MPP. You can find a pure python implementation and it's MPP variant for
Greenplum and HAWQ here:

1. Factorization Machines - Pure Python
<https://github.com/gautamsm/data-science-on-mpp/blob/master/ml_algorithms/factorization_machines.ipynb>

2. Factorization Machines - MPP
<https://github.com/gautamsm/data-science-on-mpp/blob/master/ml_algorithms/factorization_machines_mpp.ipynb>


*A quick summary of factorization machines:*

Typically, when building a recommender system, you not only have tuple
interactions (user-item, user-offer, user- TV show, user-purchase, etc.),
but also side information in the form of user and item features.
Historically, algorithms such as SVM and dense kernel methods have not been
good at estimating parameters for recommenders owing to the sparsity of the
tuple interactions. Bayesian collaborative filtering approaches have had
success, but incorporation of side information entails fairly complex
hierarchical modeling resulting in MCMC or Gibbs sampling for inference of
model parameters.

Rendle's work is an extremely interesting and novel way of thinking about
this problem and seamlessly blending tuple interactions with side
information. Since tuple interactions are modeled via a low rank matrix
decomposition, the quality of parameter estimation will not suffer from
sparsity unlike SVMs. I found the approach really elegant and further
research and few great posts on Quora (one by a leading name
<https://www.quora.com/What-are-some-of-the-best-recommendation-algorithms>,
Xavier Amatrian) convinced me about the merit of this work.


A couple of key points about factorization machines:

1. They can be seen as a generalization of SVMs and SVD++

2. *They can be applied to any real valued feature vector for regression or
classification problems (think in place of linear/logistic regression, SVMs
etc.)*

*A quick note on my MPP implementation:*

Function interface:

fm_fit (

        src_table varchar,

        id_col_name varchar,

        x_col_name varchar,

        y_col_name varchar,

        parameters varchar,

        out_table varchar

)

Parameters of the fm_fit function:

source_table - table containing the feature vectors and dependent variable
as two separate columns

id_col_name - an id column for each row

x_col_name - features

y_col_name - dependent variable

parameters - learning parameters: regularizers (lambda0, lambda1, lambda2),
number of iterations (n_iter), learning rate (learning_rate), factorization
hyper-parameter (k), random seed (integer for batch-sgd), optim (batch
or batch-std), and out_table - name of the table where the model will be
stored

1. The MPP implementation has been currently implemented for the regression
problem (squared loss function). I plan to implement the classification
problem soon (hinge/log loss).

2. The implementation assumes your training data is randomly distributed.
Two gradient descent variants are provided: mini-batch SGD and regular
batch gradient descent. The mini-batch SGD implementation is very closely
based on this paper by Smola et al.
<http://www.research.rutgers.edu/~lihong/pub/Zinkevich11Parallelized.pdf>

3.  Lastly, the order complexity of factorization machines is roughly
O(TKN) where T is number of SGD iterations, K is the number of latent
factors, and N is the feature dimension. The implementation involves a
couple of loops for estimating the parameters of the tuple interactions,
which I couldn't vectorize (I could, but the vectorization meant building
3-D matrices, which was even slower).  Consequently, an implementation of
the SGD in C or C++ (as in MADlib) will be much faster than the PL/Python
approach.

Many thanks to Xiaocheng Teng (MADlib committer) for discussion on parallel
SGD algrotihms.

Does the community feel this would be a good addition to MADlib?

Regards,

Gautam

-- 
------------------------------------------------------------------
Gautam S. Muralidhar, Ph.D.
Sr. Data Scientist, Pivotal Software, Inc.
------------------------------------------------------------------

Reply via email to