This is an automated email from the ASF dual-hosted git repository.
cdionysio pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git
The following commit(s) were added to refs/heads/main by this push:
new 7fe9f25623 [SYSTEMDS-3169] Make a tutorial
7fe9f25623 is described below
commit 7fe9f2562368f3c70fb8b9954a25e56e69eeb11f
Author: dingzz1811 <[email protected]>
AuthorDate: Wed Jan 28 10:15:41 2026 +0100
[SYSTEMDS-3169] Make a tutorial
[SYSTEMDS-3169] Make a tutorial
This patch adds a tutorial for SystemDS Python API. In the tutorial
different kinds (Cosine Similarity, ALS, Linear regression) of movie
recommender systems are created by using different functionalities of the
Python API for SystemDS. The whole tutorial python code is being added, as well
as the .rst file for the tutorial in the Docs. Furthermore, the tutorial is
being integrated in the toctree (Table of Contents Tree).
Co-authored-by: jasminthiele <[email protected]>
---
.../python/docs/source/guide/movie_recommender.rst | 492 +++++++++++++++++
src/main/python/docs/source/index.rst | 1 +
.../examples/tutorials/movie_recommender_system.py | 580 +++++++++++++++++++++
3 files changed, 1073 insertions(+)
diff --git a/src/main/python/docs/source/guide/movie_recommender.rst
b/src/main/python/docs/source/guide/movie_recommender.rst
new file mode 100644
index 0000000000..ed74b5c260
--- /dev/null
+++ b/src/main/python/docs/source/guide/movie_recommender.rst
@@ -0,0 +1,492 @@
+.. -------------------------------------------------------------
+..
+.. Licensed to the Apache Software Foundation (ASF) under one
+.. or more contributor license agreements. See the NOTICE file
+.. distributed with this work for additional information
+.. regarding copyright ownership. The ASF licenses this file
+.. to you under the Apache License, Version 2.0 (the
+.. "License"); you may not use this file except in compliance
+.. with the License. You may obtain a copy of the License at
+..
+.. http://www.apache.org/licenses/LICENSE-2.0
+..
+.. Unless required by applicable law or agreed to in writing,
+.. software distributed under the License is distributed on an
+.. "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+.. KIND, either express or implied. See the License for the
+.. specific language governing permissions and limitations
+.. under the License.
+..
+.. ------------------------------------------------------------
+
+
+Building a Movie Recommender System
+===============
+
+Have you ever wondered how Netflix, Disney+, and other streaming
+platforms know exactly which movies or TV shows to recommend to you? In
+this tutorial, we will explore how recommender systems work and show how
+to implement them using SystemDS, as well as NumPy and PyTorch for
+comparison. The goal of this tutorial is to showcase different features
+of the SystemDS framework that can be accessed with the Python API.
+
+In this tutorial, we will explore the implementation of a recommender
+system using three distinct mathematical and machine learning
+approaches:
+
+- **Cosine Similarity**: A geometric approach to measure the similarity
+ between users or items based on the angle between their preference
+ vectors.
+
+- **Matrix Factorization**: A technique often used in latent factor
+ models (like ALS) to decompose the user-item interaction matrix into
+ lower-dimensional representations.
+
+- **Linear Regression**: A supervised learning approach used to predict
+ specific ratings by modeling the relationship between user/item
+ features and the target rating.
+
+This tutorial shows only snippets of the code and the whole code can be
+found
+`here
<https://github.com/apache/systemds/tree/main/src/main/python/systemds/examples/tutorials/movie_recommender_system.py>`__.
+
+To start with the tutorial, you first have to install SystemDS:
:doc:`/getting_started/install`.
+
+
+Dataset
+~~~~~~~
+
+As a dataset we chose the `MovieLens 100K
+Dataset <https://grouplens.org/datasets/movielens/100k/>`__. It consists
+of 100.000 movie ratings from 943 different users on 1682 movies from
+the late 1990s. In the following, we will often refer to movies as
+items. The data is stored in different files:
+
+- **u.data** (contains user_id, item_id (movie), rating and timestamp),
+
+- **u.user** (contains user_id, age, gender, occupation and zip code),
+
+- **u.item** (contains item_id, movie name, release date, ImDb link, genre in
a hot-one format).
+
+Preprocessing for Cosine Similarity and Matrix Factorization
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To prepare our data for Cosine Similarity and Matrix Factorization, we
+must convert the raw ratings into a User-Item Interaction Matrix. In
+this structure, each row represents a unique user, and each column
+represents a specific movie. The intersection of a row and column
+contains the user’s rating for that movie.
+
+Because the average user has only seen a small percentage of the
+thousands of movies available, this matrix is extremely sparse. Most
+cells will contain missing values (NaN), which we must handle
+differently depending on the algorithm we choose.
+
+First, we load the MovieLens 100k dataset:
+
+.. code:: python
+
+ # Load the MovieLens 100k dataset
+ header = ['user_id', 'item_id', 'rating', 'timestamp']
+ ratings_df = pd.read_csv('movie_data/ml-100k/u.data', sep='\t',
names=header)
+
+We then use the Pandas ``.pivot()`` function to transform the data. This
+gives us the User-Item table.
+
+.. code:: python
+
+ pivot_df = ratings_df.pivot(index='user_id', columns='item_id',
values='rating')
+
+The resulting matrix provides a high-level view of our dataset’s
+interaction patterns:
+
+====== ====== ====== ======
+User Item 1 Item 2 Item 3
+====== ====== ====== ======
+user 1 5 3
+user 2 4 2
+user 3 1 5
+====== ====== ====== ======
+
+Cosine Similarity
+~~~~~~~~~~~~~~~~~
+
+Collaborative Filtering is an umbrella term for algorithms that generate
+recommendations by identifying patterns in user ratings. One of the most
+common techniques is User-Based Collaborative Filtering, where we
+calculate the similarity between users based on their rating history. To
+do this, we treat each user as a vector in a high-dimensional space and
+measure the “distance” between them using `Cosine Similarity
<https://towardsdatascience.com/cosine-similarity-how-does-it-measure-the-similarity-maths-behind-and-usage-in-python-50ad30aad7db/>`__.
+
+To calculate the cosine similarity between all users (rows) in a matrix
+:math:`X`, we normalize this matrix and then multiply it with its
+transposose.
+
+If :math:`\hat{X}` is the row-normalized version of :math:`X` such that
+each row :math:`i` is defined as
+:math:`\hat{x}_i = \frac{x_i}{\|x_i\|}`, then the entire Cosine
+Similarity matrix :math:`S` is calculated via the gramian matrix:
+
+.. math:: S = \hat{X}\hat{X}^T
+
+Using NumPy, we perform these operations using vectorization for
+efficiency:
+
+.. code:: python
+
+ # L2 norm of each user vector
+ norms = np.linalg.norm(X, axis=1, keepdims=True)
+
+ # Normalize user vectors
+ X_norm = X / norms
+
+ # Cosine similarity = dot product of normalized vectors
+ user_similarity = X_norm @ X_norm.T
+
+In SystemDS, we follow a similar logic. First, we import and initialize
+the SystemDSContext.
+
+.. code:: python
+
+ from systemds.context import SystemDSContext
+
+ with SystemDSContext() as sds:
+
+In this context window, we load the data into SystemDS and do our
+calculations, using simple matrix
+functions: :doc:`/api/operator/node/matrix`.
+
+.. code:: python
+
+ # Load into SystemDS
+ X = sds.from_numpy(X_np)
+
+ # Compute L2 norms
+ row_sums = (X * X).sum(axis=1)
+ norms = row_sums.sqrt()
+
+ # Normalize user vectors
+ X_norm = X / norms
+
+ # Cosine similarity = dot product of normalized vectors
+ user_similarity = X_norm @ X_norm.t()
+
+In SystemDS, the line ``user_similarity_op = X_norm @ X_norm.t()`` does
+not execute any math. Instead, it creates an execution plan. The actual
+computation only occurs when we call ``.compute()``, allowing SystemDS
+to optimize the entire operation.
+
+.. code:: python
+
+ user_similarity = user_similarity.compute()
+
+In both cases ``user_similarity`` gives us a diagonal matrix that shows the
+similarity for every user-user pair.
+
+While both methods produce the same results, SystemDS takes slightly
+longer for this specific dataset.
+
+=============== ===== ========
+Method NumPy SystemDS
+=============== ===== ========
+Time in seconds 0.02 0.47
+=============== ===== ========
+
+Matrix Factorization
+~~~~~~~~~~~~~~~~~~~~
+
+Another powerful method for generating movie recommendations is Matrix
+Factorization. Instead of looking at surface-level data, this technique
+uncovers latent factors, the hidden patterns that represent a user’s
+specific tastes (like a preference for 90s rom-coms) and a movie’s
+unique characteristics (like its level of whimsy).
+
+In a real-world scenario, our user-item interaction matrix :math:`R`
+is incredibly sparse because most users have only rated a tiny fraction
+of the available movies. Matrix factorization solves this by decomposing
+:math:`R` into two much smaller, lower-dimensional matrices:
+
+- :math:`P`: Representing user preferences.
+- :math:`Q`: Representing item characteristics.
+
+By multiplying these two matrices back together, we can estimate the
+missing values in our original matrix:
+
+.. math:: R \approx P \cdot Q^T
+
+To find :math:`P` and :math:`Q`, we use the optimization algorithm
+called
+`Alternating Least Squares (ALS)
<https://www.shaped.ai/blog/matrix-factorization-the-bedrock-of-collaborative-filtering-recommendations>`__.
+
+In NumPy, we manually iterate through users and items, solving a
+least-squares problem for each. This gives us full control but can be
+computationally expensive as the dataset grows. We can compute
+:math:`\hat{R} = P \cdot Q^T` like this
+`(cf. CodeSignal)
<https://codesignal.com/learn/courses/diving-deep-into-collaborative-filtering-techniques-with-als/lessons/implementing-the-alternating-least-squares-algorithm>`__:
+
+.. code:: python
+
+ # Random initialization of user and item factors
+ P = np.random.rand(num_users, rank) * 0.01
+ Q = np.random.rand(num_items, rank) * 0.01
+
+ for iteration in range(maxi):
+
+ # Update user factors
+ for u in range(num_users):
+
+ # Get only items user 'u' actually rated
+ user_mask = mask[u, :]
+ Q_u = Q[user_mask, :]
+ R_u = R[u, user_mask]
+
+ if Q_u.shape[0] > 0:
+ P[u, :] = np.linalg.solve(np.dot(Q_u.T, Q_u) + reg *
np.eye(rank), np.dot(Q_u.T, R_u))
+
+ # Update item factors
+ for i in range(num_items):
+
+ # Get only users who actually rated item 'i'
+ item_mask = mask[:, i]
+ P_i = P[item_mask, :]
+ R_i = R[item_mask, i]
+
+ if P_i.shape[0] > 0:
+ Q[i, :] = np.linalg.solve(np.dot(P_i.T, P_i) + reg *
np.eye(rank), np.dot(P_i.T, R_i))
+
+ R_hat = P @ Q.T
+
+SystemDS allows us to execute the same logic using high-level
+script-like functions that are internally optimized. It offers a wide
+variety of built-in algorithms :doc:`/api/operator/algorithms`, including ALS.
+First, we import our algorithm.
+
+.. code:: python
+
+ from systemds.operator.algorithm import als
+
+Then, we initialize the SystemDS context:
+
+.. code:: python
+
+ with SystemDSContext() as sds:
+
+To tune the model for our specific dataset, we configure the following
+hyperparameters:
+
+- ``rank = 20`` The number of latent factors (hidden features) used to
+ describe users and movies. A higher rank allows for more complexity
+ but increases the risk of overfitting.
+- ``reg = 1.0`` The regularization parameter. This prevents the model
+ from becoming too complex by penalizing large weights, helping it
+ generalize better to unseen data.
+- ``maxi = 20`` The maximum number of iterations. ALS is an iterative
+ process.
+
+Then we can do the computation.
+
+.. code:: python
+
+ # Load data into SystemDS
+ R = sds.from_numpy(pivot_df.fillna(0).values)
+
+ # Approximate factorization of R into two matrices P and Q using ALS
+ P, Q = als(R, rank=20, reg=1.0, maxi=20).compute()
+
+ R_hat = P @ Q
+
+To test how well our models generalize to new data, we performed an
+80/20 split, using the first 80,000 ratings for training and the
+remainder for testing. We compared both approaches based on execution
+time and Root Mean Squared Error (RMSE).
+
+=============== ===== ========
+Method NumPy SystemDS
+=============== ===== ========
+Time in seconds 1.09 2.48
+Train RMSE 0.67 0.67
+Test RMSE 1.03 1.01
+=============== ===== ========
+
+Both implementations are mathematically consistent. SystemDS achieved a
+slightly better Test RMSE.
+
+Linear Regression
+~~~~~~~~~~~~~~~~~
+
+Unlike Matrix Factorization, which relies purely on interaction
+patterns, Linear Regression allows us to incorporate “side information”
+about users and items. By using features like user demographics and
+movie genres, we can build a Content-Based Filtering model that predicts
+ratings based on specific attributes.
+
+Preprocessing
+^^^^^^^^^^^^^
+
+For Linear Regression and Neural Networks, our data must be strictly
+numerical and properly scaled. We begin by loading the MovieLens
+datasets:
+
+.. code:: python
+
+ ratings_df = pd.read_csv('movie_data/ml-100k/u.data', sep='\t',
names=['user_id', 'item_id', 'rating', 'timestamp'])
+ user_df = pd.read_csv('movie_data/ml-100k/u.user', sep='|',
names=['user_id', 'age', 'gender', 'occupation', 'zip_code'])
+ item_df = pd.read_csv('movie_data/ml-100k/u.item', sep='|', names=[
+ 'item_id', 'title', 'release_date', 'video_release_date', 'IMDb_URL',
'unknown', 'Action', 'Adventure', 'Animation',
+ "Children's", 'Comedy', 'Crime', 'Documentary', 'Drama', 'Fantasy',
'Film-Noir', 'Horror', 'Musical', 'Mystery',
+ 'Romance', 'Sci-Fi', 'Thriller', 'War', 'Western'], encoding='latin-1')
+
+Libraries like NumPy and SystemDS cannot process strings (e.g.,
+“Student” or “Female”). We must transform these into numerical
+representations:
+
+.. code:: python
+
+ # Turn categorical data into numerical
+ user_df['gender'] = user_df['gender'].apply(lambda x: 0 if x == 'F' else 1)
+ user_df = pd.get_dummies(user_df, columns=['occupation'])
+ item_df['release_date'] = pd.to_datetime(item_df['release_date'],
errors='raise', format='%d-%b-%Y')
+ item_df['release_year'] = item_df['release_date'].dt.year
+
+Features like ``age`` and ``release_year`` have different scales. If
+left unscaled, the model might incorrectly give more “weight” to the
+larger year values. We normalize them to a 0–1 range to ensure equal
+influence.
+
+.. code:: python
+
+ # Normalize data
+ user_df['age'] = (user_df['age'] - user_df['age'].min()) /
(user_df['age'].max() - user_df['age'].min())
+ item_df['release_year'] = (item_df['release_year'] -
item_df['release_year'].min()) / (item_df['release_year'].max() -
item_df['release_year'].min())
+
+Finally, we merge these datasets into a single table. Each row
+represents a specific rating, enriched with all available user and movie
+features. After merging, we drop non-numerical columns (like ``title``
+or ``IMDb_URL``), remove rows with NaN-values and split the data into
+Training and Testing sets.
+
+Linear Regression with PyTorch
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+PyTorch is a popular deep learning framework that approaches linear
+regression as an iterative optimization problem. We use Gradient Descent
+to minimize the Mean Squared Error (MSE) by repeatedly updating the
+model’s weights based on calculated gradients.
+
+Data must be converted to ``torch.Tensor`` format.
+
+.. code:: python
+
+ X_train_tensor = torch.from_numpy(X_train).float()
+ y_train_tensor = torch.from_numpy(y_train).float().reshape(-1, 1)
+ X_test_tensor = torch.from_numpy(X_test).float()
+ y_test_tensor = torch.from_numpy(y_test).float().reshape(-1, 1)
+
+We define a model class and an optimizer (SGD). The learning rate
+(``lr``) determines the step size for each update.
+
+.. code:: python
+
+ n_features = X_train.shape[1]
+
+ class linearRegression(torch.nn.Module):
+ def __init__(self):
+ super(linearRegression, self).__init__()
+ # input size: n_features, output size: 1
+ self.linear = torch.nn.Linear(n_features, 1)
+
+ def forward(self, x):
+ out = self.linear(x)
+ return out
+
+ lr_model = linearRegression()
+ criterion = torch.nn.MSELoss()
+ optimizer = torch.optim.SGD(lr_model.parameters(), lr = 0.01)
+
+The model iterates through the dataset for a set number of epochs. In
+each iteration, it performs a forward pass, calculates the loss, and
+backpropagates the gradients to update the weights.
+
+.. code:: python
+
+ for epoch in range(1000):
+
+ # Forward pass and loss
+ pred_y = lr_model(X_train_tensor)
+ loss = criterion(pred_y, y_train_tensor)
+
+ # Backward pass and optimization
+ optimizer.zero_grad()
+ loss.backward()
+ optimizer.step()
+
+We use ``.eval()`` and ``torch.no_grad()`` to disable gradient tracking
+during inference
+
+.. code:: python
+
+ lr_model.eval()
+ with torch.no_grad():
+ y_pred_test = lr_model(X_test_tensor)
+
+Then, we can calculate the RMSE.
+
+.. code:: python
+
+ y_pred = y_pred_test.numpy().flatten()
+ y_true = y_test_tensor.numpy().flatten()
+ rmse = np.sqrt(np.mean((y_pred - y_true) ** 2))
+
+Linear Regression with SystemDS
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Following the same pattern as ALS, SystemDS
+provides a highly optimized, built-in algorithm for linear regression.
+This implementation is designed to handle large-scale data by
+automatically deciding between direct solvers and conjugate gradient
+methods based on the data’s characteristics.
+
+First, we import the ``lm`` training algorithm and the ``lmPredict``
+function for inference.
+
+.. code:: python
+
+ from systemds.operator.algorithm import lm, lmPredict
+
+We transfer our NumPy arrays into the SystemDS context.
+
+.. code:: python
+
+ X_ds = sds.from_numpy(X_train)
+ y_ds = sds.from_numpy(y_train)
+ X_test_ds = sds.from_numpy(X_test)
+
+We call the ``lm`` function to train our model.
+
+.. code:: python
+
+ model = lm(X=X_ds, y=y_ds)
+
+To generate predictions for the test set, we use ``lmPredict``. Because
+SystemDS uses Lazy Evaluation, the actual computation is only triggered
+when we call ``.compute()``.
+
+.. code:: python
+
+ predictions = lmPredict(X_test_ds, model).compute()
+
+Finally, we calculate the RMSE to compare the performance against your
+PyTorch implementation.
+
+Comparison
+^^^^^^^^^^
+
+=============== ======= ========
+Method PyTorch SystemDS
+=============== ======= ========
+Time in seconds 1.77 0.87
+Test RMSE 1.13 1.08
+=============== ======= ========
+
+Using linear regression, SystemDS worked way faster than our PyTorch
+approach and achieved better results.
diff --git a/src/main/python/docs/source/index.rst
b/src/main/python/docs/source/index.rst
index 09de5494df..98dd939344 100644
--- a/src/main/python/docs/source/index.rst
+++ b/src/main/python/docs/source/index.rst
@@ -54,6 +54,7 @@ tensors (multi-dimensional arrays) whose first dimension may
have a heterogeneou
guide/federated.rst
guide/algorithms_basics.rst
guide/python_end_to_end_tut.rst
+ guide/movie_recommender.rst
.. toctree::
:maxdepth: 1
diff --git
a/src/main/python/systemds/examples/tutorials/movie_recommender_system.py
b/src/main/python/systemds/examples/tutorials/movie_recommender_system.py
new file mode 100644
index 0000000000..66e486949a
--- /dev/null
+++ b/src/main/python/systemds/examples/tutorials/movie_recommender_system.py
@@ -0,0 +1,580 @@
+#!/usr/bin/env python3
+# -------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements. See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership. The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License. You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied. See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+# -------------------------------------------------------------
+
+import time
+import pandas as pd
+import numpy as np
+import torch
+import torch.nn as nn
+import torch.optim as optim
+from systemds.context import SystemDSContext
+from systemds.operator.algorithm import als, lm, lmPredict
+
+# To run this code, first download the MovieLens 100k dataset from
+# https://grouplens.org/datasets/movielens/100k/ and extract it to the
specified folder.
+
+data_folder = "/movie_data/ml-100k/"
+
+
+def read_movie_data(n_rows: int = 10000) -> pd.DataFrame:
+ """
+ Reads the MovieLens 100k dataset and returns a DataFrame with the
following columns: user_id, item_id, rating.
+
+ :param n_rows: Number of rows to read from the dataset.
+ :return: DataFrame containing the movie ratings data.
+ """
+
+ # Load the MovieLens 100k dataset
+ header = ["user_id", "item_id", "rating", "timestamp"]
+ ratings_df = pd.read_csv(data_folder + "u.data", sep="\t", names=header)
+
+ # Drop timestamp column
+ ratings_df = ratings_df.drop("timestamp", axis=1)
+
+ # Only check first n_rows rows to speed up processing
+ ratings_df = ratings_df.head(n_rows)
+
+ return ratings_df
+
+
+def create_pivot_table(ratings_df: pd.DataFrame) -> pd.DataFrame:
+ """
+ Creates a pivot table from the ratings DataFrame where rows are users,
columns are items, and values are ratings.
+
+ :param ratings_df: DataFrame containing the movie ratings data with
columns user_id, item_id, rating.
+ :return: Pivot table with users as rows, items as columns, and ratings as
values.
+ """
+
+ return ratings_df.pivot(index="user_id", columns="item_id",
values="rating")
+
+
+### Cosine Similarity Functions ###
+
+
+def numpy_cosine_similarity(pivot_df: pd.DataFrame) -> tuple[pd.DataFrame,
float]:
+ """
+ Calculates the cosine similarity between users using NumPy.
+
+ :param pivot_df: DataFrame containing the pivot table of user-item ratings.
+ :return: DataFrame containing the cosine similarity between users and time
taken.
+ """
+
+ # zeros = unrated items
+ X = pivot_df.fillna(0).values
+
+ start = time.time()
+
+ # L2 norm of each user vector
+ norms = np.linalg.norm(X, axis=1, keepdims=True)
+
+ # Normalize user vectors
+ X_norm = X / norms
+
+ # Cosine similarity = dot product of normalized vectors
+ user_similarity = X_norm @ X_norm.T
+
+ end = time.time()
+
+ # convert to DataFrame for better readability
+ user_similarity_df = pd.DataFrame(
+ user_similarity, index=pivot_df.index, columns=pivot_df.index
+ )
+
+ return user_similarity_df, end - start
+
+
+def systemds_cosine_similarity(pivot_df: pd.DataFrame) -> tuple[pd.DataFrame,
float]:
+ """
+ Calculates the cosine similarity between users using SystemDS.
+
+ :param pivot_df: DataFrame containing the pivot table of user-item ratings.
+ :return: DataFrame containing the cosine similarity between users and time
taken.
+ """
+
+ # Zeros = unrated items
+ X_np = pivot_df.fillna(0).values
+
+ with SystemDSContext() as sds:
+
+ start = time.time()
+
+ # Load into SystemDS
+ X = sds.from_numpy(X_np)
+
+ # Compute L2 norms
+ row_sums = (X * X).sum(axis=1)
+ norms = row_sums.sqrt()
+
+ # Normalize user vectors
+ X_norm = X / norms
+
+ # Cosine similarity = dot product of normalized vectors
+ user_similarity = X_norm @ X_norm.t()
+
+ # Compute result
+ user_similarity = user_similarity.compute()
+
+ end = time.time()
+
+ # Convert to DataFrame for better readability
+ user_similarity_df = pd.DataFrame(
+ user_similarity, index=pivot_df.index, columns=pivot_df.index
+ )
+
+ return user_similarity_df, end - start
+
+
+def evaluate_cosine_similarity() -> None:
+ """
+ Evaluates and compares the cosine similarity computations between NumPy
and SystemDS.
+ """
+
+ ratings = read_movie_data(100000)
+ pivot_df = create_pivot_table(ratings)
+
+ numpy_df, numpy_time = numpy_cosine_similarity(pivot_df)
+ systemds_df, systemds_time = systemds_cosine_similarity(pivot_df)
+
+ # Check if the results are approximately equal
+ if np.allclose(numpy_df.values, systemds_df.values, atol=1e-8):
+ print("Cosine similarity DataFrames are approximately equal.")
+ else:
+ print("Cosine similarity DataFrames are NOT equal.")
+
+ print(f"Time taken for NumPy cosine similarity: {numpy_time}")
+ print(f"Time taken for SystemDS cosine similarity: {systemds_time}")
+
+
+### Matrix Factorization Functions ###
+
+
+def numpy_als(
+ pivot_df: pd.DataFrame, rank: int, reg: float, maxi: int
+) -> tuple[pd.DataFrame, float]:
+ """
+ Calculates a matrix R_hat using Alternating Least Squares (ALS) matrix
factorization in numpy.
+
+ :param pivot_df: DataFrame containing the pivot table of user-item ratings.
+ :return: DataFrame containing the predicted ratings and time taken.
+ """
+
+ # Fill NaNs with zeros for computation
+ R = pivot_df.fillna(0).values
+
+ start = time.time()
+ num_users, num_items = R.shape
+ mask = R != 0
+
+ # Random initialization of user and item factors
+ P = np.random.rand(num_users, rank) * 0.01
+ Q = np.random.rand(num_items, rank) * 0.01
+
+ for iteration in range(maxi):
+
+ # Update user factors
+ for u in range(num_users):
+
+ # Get only items user 'u' actually rated
+ user_mask = mask[u, :]
+ Q_u = Q[user_mask, :]
+ R_u = R[u, user_mask]
+
+ if Q_u.shape[0] > 0:
+ P[u, :] = np.linalg.solve(
+ np.dot(Q_u.T, Q_u) + reg * np.eye(rank), np.dot(Q_u.T, R_u)
+ )
+
+ # Update item factors
+ for i in range(num_items):
+
+ # Get only users who actually rated item 'i'
+ item_mask = mask[:, i]
+ P_i = P[item_mask, :]
+ R_i = R[item_mask, i]
+
+ if P_i.shape[0] > 0:
+ Q[i, :] = np.linalg.solve(
+ np.dot(P_i.T, P_i) + reg * np.eye(rank), np.dot(P_i.T, R_i)
+ )
+
+ end = time.time()
+
+ # Multiply P and Q to get the approximated ratings matrix
+ R_hat = P @ Q.T
+
+ # Convert to DataFrame for better readability
+ ratings_hat_df = pd.DataFrame(R_hat, index=pivot_df.index,
columns=pivot_df.columns)
+
+ return ratings_hat_df, end - start
+
+
+def systemds_als(
+ pivot_df: pd.DataFrame, rank: int, reg: float, maxi: int
+) -> tuple[pd.DataFrame, float]:
+ """
+ Calculates a matrix R_hat using Alternating Least Squares (ALS) matrix
factorization in SystemDS.
+
+ :param pivot_df: DataFrame containing the pivot table of user-item ratings.
+ :return: DataFrame containing the predicted ratings and time taken.
+ """
+
+ start = time.time()
+
+ with SystemDSContext() as sds:
+
+ # Load data into SystemDS
+ R = sds.from_numpy(pivot_df.fillna(0).values)
+
+ # Approximate factorization of R into two matrices P and Q using ALS
+ P, Q = als(R, rank=rank, reg=reg, maxi=maxi).compute()
+ end = time.time()
+
+ # Multiply P and Q to get the approximated ratings matrix
+ R_hat = P @ Q
+
+ # Convert to DataFrame for better readability
+ ratings_hat_df = pd.DataFrame(R_hat, index=pivot_df.index,
columns=pivot_df.columns)
+
+ return ratings_hat_df, end - start
+
+
+def evaluate_als(
+ model: str = "systemds", rank: int = 10, reg: float = 1.0, maxi: int = 20
+) -> None:
+ """
+ Evaluates and compares the ALS computations between NumPy and SystemDS.
The data is split into training
+ and test sets with an 80/20 ratio. Then the RMSE is calculated for both
sets.
+
+ :param model: Model to use for ALS computation ("systemds" or "numpy").
+ :param rank: Rank of the factorized matrices.
+ :param reg: Regularization parameter.
+ :param maxi: Maximum number of iterations.
+ """
+
+ ratings = read_movie_data(100000)
+ pivot_df = create_pivot_table(ratings[:80000])
+
+ if model == "systemds":
+ ratings_hat_df, systemds_time = systemds_als(pivot_df, rank, reg, maxi)
+ else:
+ ratings_hat_df, numpy_time = numpy_als(pivot_df, rank, reg, maxi)
+
+ # Print time taken
+ print(
+ f"Time taken for {model} ALS: ",
+ systemds_time if model == "systemds" else numpy_time,
+ )
+
+ # Training error
+ mask = ~np.isnan(pivot_df.values)
+ train_rmse = np.sqrt(
+ np.mean((ratings_hat_df.values[mask] - pivot_df.values[mask]) ** 2)
+ )
+ print(f"Train RMSE for model with {model}: {train_rmse}")
+
+ # Test error
+ test_set = ratings[80000:]
+ stacked_series = ratings_hat_df.stack()
+ ratings_hat_long = stacked_series.reset_index()
+ ratings_hat_long.columns = ["user_id", "item_id", "rating"]
+
+ merged_df = pd.merge(
+ test_set,
+ ratings_hat_long,
+ on=["user_id", "item_id"],
+ how="inner",
+ suffixes=("_actual", "_predicted"),
+ )
+
+ # Force predictions to stay between 0.5 and 5.0
+ merged_df["rating_predicted"] = merged_df["rating_predicted"].clip(0.5,
5.0)
+
+ # Calculate root mean squared error (RMSE)
+ squared_errors = (merged_df["rating_actual"] -
merged_df["rating_predicted"]) ** 2
+ mse = np.mean(squared_errors)
+ test_rmse = np.sqrt(mse)
+
+ print(f"Test RMSE for model with {model}: {test_rmse}")
+
+
+### Linear Regression ###
+
+
+def preprocess_data() -> tuple[np.ndarray, np.ndarray, np.ndarray, np.ndarray]:
+ """
+ This function reads and preprocesses the MovieLens 100k dataset for linear
regression. It returns four
+ different numpy arrays: X_train, y_train, X_test, y_test. The
preprocessing steps include:
+ - Reading the datasets
+ - Handling categorical variables
+ - Normalizing numerical features
+ - Merging datasets
+ - Dropping unnecessary columns
+ - Dropping rows with NaN values
+ - Splitting into training and testing sets.
+
+ :return: tuple of numpy arrays (X_train, y_train, X_test, y_test)
+ """
+
+ # Read datasets
+ ratings_df = pd.read_csv(
+ data_folder + "u.data",
+ sep="\t",
+ names=["user_id", "item_id", "rating", "timestamp"],
+ )
+ user_df = pd.read_csv(
+ data_folder + "u.user",
+ sep="|",
+ names=["user_id", "age", "gender", "occupation", "zip_code"],
+ )
+ item_df = pd.read_csv(
+ data_folder + "u.item",
+ sep="|",
+ names=[
+ "item_id",
+ "title",
+ "release_date",
+ "video_release_date",
+ "IMDb_URL",
+ "unknown",
+ "Action",
+ "Adventure",
+ "Animation",
+ "Children's",
+ "Comedy",
+ "Crime",
+ "Documentary",
+ "Drama",
+ "Fantasy",
+ "Film-Noir",
+ "Horror",
+ "Musical",
+ "Mystery",
+ "Romance",
+ "Sci-Fi",
+ "Thriller",
+ "War",
+ "Western",
+ ],
+ encoding="latin-1",
+ )
+
+ # Turn categorical data into numerical
+ user_df["gender"] = user_df["gender"].apply(lambda x: 0 if x == "F" else 1)
+ user_df = pd.get_dummies(user_df, columns=["occupation"])
+ item_df["release_date"] = pd.to_datetime(
+ item_df["release_date"], errors="raise", format="%d-%b-%Y"
+ )
+ item_df["release_year"] = item_df["release_date"].dt.year
+
+ # Normalize data
+ user_df["age"] = (user_df["age"] - user_df["age"].min()) / (
+ user_df["age"].max() - user_df["age"].min()
+ )
+ item_df["release_year"] = (
+ item_df["release_year"] - item_df["release_year"].min()
+ ) / (item_df["release_year"].max() - item_df["release_year"].min())
+
+ # Merge datasets
+ merged_df = ratings_df.merge(user_df, on="user_id").merge(item_df,
on="item_id")
+
+ # Drop unnecessary columns
+ merged_df = merged_df.drop(
+ [
+ "user_id",
+ "item_id",
+ "timestamp",
+ "zip_code",
+ "title",
+ "release_date",
+ "video_release_date",
+ "IMDb_URL",
+ "unknown",
+ ],
+ axis=1,
+ )
+
+ # Convert boolean columns to integers (important for NumPy and SystemDS)
+ bool_cols = merged_df.select_dtypes(include=["bool"]).columns
+ merged_df[bool_cols] = merged_df[bool_cols].astype(int)
+
+ # Drop rows with NaN values
+ merged_df = merged_df.dropna()
+
+ ratings = merged_df.pop("rating")
+ features = merged_df
+
+ # Split into train and test sets and convert to numpy arrays
+ train_size = int(0.8 * len(ratings))
+ X_train = features[:train_size].to_numpy()
+ y_train = ratings[:train_size].to_numpy()
+ X_test = features[train_size:].to_numpy()
+ y_test = ratings[train_size:].to_numpy()
+
+ print("NaNs in X:", np.isnan(X_train).any())
+ print("NaNs in y:", np.isnan(y_train).any())
+
+ return X_train, y_train, X_test, y_test
+
+
+def linear_regression_pytorch(
+ X_train, y_train, X_test, y_test, num_epochs=1000
+) -> tuple[float, float]:
+ """
+ Trains a linear regression model using PyTorch.
+
+ :param X_train, X_test: numpy arrays of shape (n_samples, n_features)
+ :param y_train, y_test: numpy arrays of shape (n_samples,)
+ :param num_epochs: number of training iterations
+
+ :return rmse: RMSE on test set
+ :return time taken: time in seconds for training and prediction
+ """
+
+ start = time.time()
+
+ # Convert to PyTorch tensors
+ X_train_tensor = torch.from_numpy(X_train).float()
+ y_train_tensor = torch.from_numpy(y_train).float().reshape(-1, 1)
+ X_test_tensor = torch.from_numpy(X_test).float()
+ y_test_tensor = torch.from_numpy(y_test).float().reshape(-1, 1)
+
+ # Define model
+ n_features = X_train.shape[1]
+
+ class linearRegression(torch.nn.Module):
+ def __init__(self):
+ super(linearRegression, self).__init__()
+ # input size: n_features, output size: 1
+ self.linear = torch.nn.Linear(n_features, 1)
+
+ def forward(self, x):
+ out = self.linear(x)
+ return out
+
+ lr_model = linearRegression()
+
+ # Loss and optimizer
+ criterion = torch.nn.MSELoss()
+ optimizer = torch.optim.SGD(lr_model.parameters(), lr=0.01)
+
+ # Training loop
+ for epoch in range(num_epochs):
+
+ # Forward pass and loss
+ pred_y = lr_model(X_train_tensor)
+ loss = criterion(pred_y, y_train_tensor)
+
+ # Backward pass and optimization
+ optimizer.zero_grad()
+ loss.backward()
+ optimizer.step()
+
+ if (epoch + 1) % 100 == 0:
+ print(f"Epoch [{epoch+1}/{num_epochs}], Loss: {loss.item():.4f}")
+
+ # Make predictions on test set
+ lr_model.eval()
+ with torch.no_grad():
+ y_pred_test = lr_model(X_test_tensor)
+
+ end = time.time()
+
+ y_pred = y_pred_test.numpy().flatten()
+ y_true = y_test_tensor.numpy().flatten()
+
+ # Compute RMSE
+ rmse = np.sqrt(np.mean((y_pred - y_true) ** 2))
+
+ return rmse, end - start
+
+
+def linear_regression_systemds(
+ X_train, y_train, X_test, y_test, num_epochs=1000
+) -> tuple[float, float]:
+ """
+ Trains a linear regression model using SystemDS.
+
+ :param X_train, X_test: numpy arrays of shape (n_samples, n_features)
+ :param y_train, y_test: numpy arrays of shape (n_samples,)
+ :param num_epochs: maximum number of training iterations
+
+ :return rmse: RMSE on test set
+ :return time taken: time in seconds for training and prediction
+ """
+
+ with SystemDSContext() as sds:
+
+ start = time.time()
+
+ # Read data into SystemDS
+ X_ds = sds.from_numpy(X_train)
+ y_ds = sds.from_numpy(y_train)
+ X_test_ds = sds.from_numpy(X_test)
+
+ # Train linear regression model with max iterations
+ model = lm(X=X_ds, y=y_ds, maxi=num_epochs)
+ # Make predictions on test set
+ predictions = lmPredict(X_test_ds, model).compute()
+
+ end = time.time()
+
+ y_pred = predictions.flatten()
+ y_true = y_test.flatten()
+
+ # Compute RMSE
+ rmse = np.sqrt(np.mean((y_pred - y_true) ** 2))
+
+ return rmse, end - start
+
+
+def evaluate_lr() -> None:
+ """
+ Evaluates and compares the linear regression computations between PyTorch
and SystemDS. The data is split into
+ training and test sets with an 80/20 ratio. Then the RMSE is calculated
for both sets.
+ """
+
+ print("Evaluating Linear Regression Models...")
+
+ X_train, y_train, X_test, y_test = preprocess_data()
+
+ pytorch_rmse, pytorch_time = linear_regression_pytorch(
+ X_train, y_train, X_test, y_test, num_epochs=1000
+ )
+ systemds_rmse, systemds_time = linear_regression_systemds(
+ X_train, y_train, X_test, y_test, num_epochs=1000
+ )
+
+ print(f"PyTorch RMSE: {pytorch_rmse}, Time: {pytorch_time} seconds")
+ print(f"SystemDS RMSE: {systemds_rmse}, Time: {systemds_time} seconds")
+
+
+if __name__ == "__main__":
+
+ # Cosine Similarity
+ evaluate_cosine_similarity()
+
+ # Matrix Factorization using ALS
+ evaluate_als("systemds")
+ evaluate_als("numpy")
+
+ # Linear Regression
+ evaluate_lr()