dkerschbaumer commented on a change in pull request #1334:
URL: https://github.com/apache/systemds/pull/1334#discussion_r671126117
##########
File path: scripts/builtin/xgboost.dml
##########
@@ -0,0 +1,780 @@
+# INPUT PARAMETERS:
+#
---------------------------------------------------------------------------------------------
+# NAME TYPE DEFAULT MEANING
+#
---------------------------------------------------------------------------------------------
+# X Matrix[Double] --- Feature matrix X; note that X
needs to be both recoded and dummy coded
+# Y Matrix[Double] --- Label matrix Y; note that Y
needs to be both recoded and dummy coded
+# R Matrix[Double] 1, 1xn Matrix R; 1xn
vector which for each feature in X contains the following information
+#
- R[,1]: 1 (scalar feature)
+#
- R[,2]: 2 (categorical feature)
+# Feature 1 is a scalar feature
and features 2 is a categorical feature
+# If R is not provided by default
all variables are assumed to be scale (1)
+# sml_type Integer 1 Supervised machine learning
type: 1 = Regression(default), 2 = Classification
+# num_trees Integer 7 Number of trees to be created in
the xgboost model
+# learning_rate Double 0.3 Alias: eta. After each boosting
step the learning rate controls the weights of the new predictions
+# max_depth Integer 6 Maximum depth of a tree.
Increasing this value will make the model more complex and more likely to
overfit
+# lambda Double 0.0 L2 regularization term on
weights. Increasing this value will make model more conservative and reduce
amount of leaves of a tree
+#
---------------------------------------------------------------------------------------------
+
+#
---------------------------------------------------------------------------------------------
+# OUTPUT:
+# Matrix M where each column corresponds to a node in the learned tree (the
first node is the init prediction) and each row contains the following
information:
+# M[1,j]: id of node j (in a complete binary tree)
+# M[2,j]: tree id to which node j belongs
+# M[3,j]: Offset (no. of columns) to left child of j if j is an internal
node, otherwise 0
+# M[4,j]: Feature index of the feature (scale feature id if the feature is
scale or categorical feature id if the feature is categorical)
+# that node j looks at if j is an internal node, otherwise 0
+# M[5,j]: Type of the feature that node j looks at if j is an internal node.
if leaf = 0, if scalar = 1, if categorical = 2
+# M[6:,j]: If j is an internal node: Threshold the example's feature value is
compared to is stored at M[6,j] if the feature chosen for j is scale,
+# otherwise if the feature chosen for j is categorical rows 6,7,... depict
the value subset chosen for j
+# If j is a leaf node 1 if j is impure and the number of samples at j >
threshold, otherwise 0
+#
-------------------------------------------------------------------------------------------
+
+m_xgboost = function(Matrix[Double] X, Matrix[Double] y, Matrix[Double] R =
matrix(1,rows=1,cols=nrow(X)),
+ Integer sml_type = 1, Integer num_trees = 7, Double learning_rate = 0.3,
Integer max_depth = 6, Double lambda = 0.0)
+ return (Matrix[Double] M) {
+ # test if input correct
+ assert(nrow(X) == nrow(y))
+ assert(ncol(y) == 1)
+ assert(nrow(R) == 1)
+
+ M = matrix(0,rows=6,cols=0)
+ # set the init prediction at first col in M
+ init_prediction_matrix = matrix("0 0 0 0 0 0",rows=nrow(M),cols=1)
+ init_prediction_matrix[6,1] = median(y)
+ M = cbind(M, init_prediction_matrix)
Review comment:
done in
https://github.com/apache/systemds/pull/1334/commits/69b2a042249479107f19f1ed82f930ad1fbe81bc
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]