(systemds) branch main updated: [SYSTEMDS-3731] Initial cluster-and-classify ensemble primitive

mboehm7 Thu, 29 Aug 2024 04:00:00 -0700

This is an automated email from the ASF dual-hosted git repository.

mboehm7 pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/systemds.git



The following commit(s) were added to refs/heads/main by this push:
     new ff7f57b683 [SYSTEMDS-3731] Initial cluster-and-classify ensemble 
primitive
ff7f57b683 is described below

commit ff7f57b683a2ae4413150b9fa1ff74e44f9df7bb
Author: Matthias Boehm <mboe...@gmail.com>
AuthorDate: Thu Aug 29 12:28:40 2024 +0200

    [SYSTEMDS-3731] Initial cluster-and-classify ensemble primitive
    
    This patch introduces the first exploration of a new primitive for
    ensemble classification, where we first cluster the dataset and then
    simply train a linear model for every cluster (data points that belong
    to this cluster). On first test datasets (Adult and Covtype), this
    simple strategy of model specialization yields remarkably good results.
    The test accuracies have improved from 85% to 96% and from 64% to 93%.
    
    Right now these scripts are still in staging, because a full
    integration should use eval for allowing to pass any classification
    method, should use a train/validation/test splits, and provide
    separate train/predict builtin functions.
---
 .../clusterAndClassify/clusteredClassification.dml | 99 ++++++++++++++++++++++
 .../staging/clusterAndClassify/results_Adult.out   | 47 ++++++++++
 .../staging/clusterAndClassify/results_Covtype.out | 49 +++++++++++
 3 files changed, 195 insertions(+)

diff --git a/scripts/staging/clusterAndClassify/clusteredClassification.dml 
b/scripts/staging/clusterAndClassify/clusteredClassification.dml
new file mode 100644
index 0000000000..f54545861f
--- /dev/null
+++ b/scripts/staging/clusterAndClassify/clusteredClassification.dml
@@ -0,0 +1,99 @@
+#-------------------------------------------------------------
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+#
+#-------------------------------------------------------------
+
+
+F = read("./data/Adult.csv", data_type="frame", format="csv", header=FALSE);
+jspec2= "{ ids:true, recode:[15], dummycode:[2,4,6,7,8,9,10,14]}"
+
+/*
+F = read("./data/Covtype.csv", data_type="frame", format="csv", header=FALSE);
+jspec2= "{ ids:true, 
recode:[11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,"
++"31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54], 
bin:["
++"{id:1, method:equi-width, numbins:10},"
++"{id:2, method:equi-width, numbins:10},"
++"{id:3, method:equi-width, numbins:10},"
++"{id:4, method:equi-width, numbins:10},"
++"{id:5, method:equi-width, numbins:10},"
++"{id:6, method:equi-width, numbins:10},"
++"{id:7, method:equi-width, numbins:10},"
++"{id:8, method:equi-width, numbins:10},"
++"{id:9, method:equi-width, numbins:10},"
++"{id:10, method:equi-width, numbins:10}]}"
+*/
+
+[X,M] = transformencode(target=F, spec=jspec2);
+y = X[,ncol(X)];
+X = X[,2:(ncol(X)-1)]
+X = scale(X=X)
+
+[Xtrain,Xtest,ytrain,ytest] = split(X=X,Y=y,f=0.7,cont=FALSE,seed=7)
+
+# learn baseline model
+B = multiLogReg(X=Xtrain, Y=ytrain, maxii=50, icpt=2, reg=0.001, 
verbose=FALSE);
+[M,yhat,acc] = multiLogRegPredict(X=Xtrain, B=B, Y=ytrain, verbose=TRUE);
+[M,yhat,acc] = multiLogRegPredict(X=Xtest, B=B, Y=ytest, verbose=TRUE);
+
+print("Clustered Classification:")
+for(k in 2:16) {
+  print("-- w/ "+k+" clusters")
+
+  # clustering
+  [C,Yk] = kmeans(X=Xtrain, k=k);
+
+  # train a model per cluster and get train accuracy
+  models = list();
+  acctrain = 0;
+  count = 0;
+  for(i in 1:k) {
+    I = (Yk == k);
+    Xi = removeEmpty(target=Xtrain, margin="rows", select=I);
+    yi = removeEmpty(target=ytrain, margin="rows", select=I);
+    if( sum(I) > 15 & (max(yi)-min(yi)) > 0 ) {
+      Bi = multiLogReg(X=Xi, Y=yi, maxii=50, icpt=2, reg=0.001, verbose=FALSE);
+      [Mi,yhati,acci] = multiLogRegPredict(X=Xi, B=Bi, Y=yi, verbose=FALSE);
+      acctrain += acci; count = count+1;
+      models = append(models, Bi);
+    }
+    else {
+      models = append(models, as.matrix(0));
+    }
+  }
+  if(count>=1)
+    print("---- train accuracy: "+(acctrain/count))
+
+  # computer test accuracy
+  acctest = 0;
+  count = 0;
+  Yk = kmeansPredict(X=Xtest, C=C)
+  for(i in 1:k) {
+    Bi = as.matrix(models[i])
+    if(nrow(Bi)>1) {
+      I = (Yk == k);
+      Xi = removeEmpty(target=Xtest, margin="rows", select=I);
+      yi = removeEmpty(target=ytest, margin="rows", select=I);
+      [Mi,yhati,acci] = multiLogRegPredict(X=Xi, B=Bi, Y=yi, verbose=FALSE);
+      acctest += acci; count = count+1;
+    }
+  }
+  if(count >= 1)
+    print("---- test accuracy: "+(acctest/count))
+}
+
diff --git a/scripts/staging/clusterAndClassify/results_Adult.out 
b/scripts/staging/clusterAndClassify/results_Adult.out
new file mode 100644
index 0000000000..b51fcb809c
--- /dev/null
+++ b/scripts/staging/clusterAndClassify/results_Adult.out
@@ -0,0 +1,47 @@
+Accuracy (%): 85.07940686145477
+Accuracy (%): 85.04146616156444
+Clustered Classification:
+-- w/ 2 clusters
+---- train accuracy: 74.02917828792106
+---- test accuracy: 73.34669338677354
+-- w/ 3 clusters
+---- train accuracy: 92.63965475905059
+---- test accuracy: 93.21813452248234
+-- w/ 4 clusters
+---- train accuracy: 95.09137769447048
+---- test accuracy: 94.98767460969597
+-- w/ 5 clusters
+---- train accuracy: 92.85159285159286
+---- test accuracy: 90.01814882032669
+-- w/ 6 clusters
+---- train accuracy: 81.9494584837545
+---- test accuracy: 71.77419354838709
+-- w/ 7 clusters
+---- train accuracy: 96.00537092984223
+---- test accuracy: 96.18350038550501
+-- w/ 8 clusters
+---- train accuracy: 100.0
+---- test accuracy: 28.571428571428562
+-- w/ 9 clusters
+---- train accuracy: 98.0735551663748
+---- test accuracy: 95.79439252336451
+-- w/ 10 clusters
+-- w/ 11 clusters
+---- train accuracy: 100.0
+---- test accuracy: 94.73684210526314
+-- w/ 12 clusters
+---- train accuracy: 85.55555555555556
+---- test accuracy: 74.11764705882354
+-- w/ 13 clusters
+---- train accuracy: 100.0
+---- test accuracy: 62.5
+-- w/ 14 clusters
+---- train accuracy: 95.90227010388614
+---- test accuracy: 95.93353738522083
+-- w/ 15 clusters
+---- train accuracy: 71.98719732813802
+---- test accuracy: 71.81996086105677
+-- w/ 16 clusters
+---- train accuracy: 96.97986577181207
+---- test accuracy: 91.40625
+
diff --git a/scripts/staging/clusterAndClassify/results_Covtype.out 
b/scripts/staging/clusterAndClassify/results_Covtype.out
new file mode 100644
index 0000000000..4ee0db27ea
--- /dev/null
+++ b/scripts/staging/clusterAndClassify/results_Covtype.out
@@ -0,0 +1,49 @@
+Accuracy (%): 64.22693508874097
+Accuracy (%): 64.394596675496
+Clustered Classification:
+-- w/ 2 clusters
+---- train accuracy: 62.47535986828932
+---- test accuracy: 62.797693657558284
+-- w/ 3 clusters
+---- train accuracy: 69.05772239453243
+---- test accuracy: 69.14310536469446
+-- w/ 4 clusters
+---- train accuracy: 76.23711340206185
+---- test accuracy: 78.27569410618607
+-- w/ 5 clusters
+---- train accuracy: 63.71441760270811
+---- test accuracy: 64.05905557489268
+-- w/ 6 clusters
+---- train accuracy: 60.43270151254069
+---- test accuracy: 60.93907006485795
+-- w/ 7 clusters
+---- train accuracy: 63.167114187522365
+---- test accuracy: 63.54799167593148
+-- w/ 8 clusters
+---- train accuracy: 3.562259306803594
+---- test accuracy: 3.42047343125365
+-- w/ 9 clusters
+---- train accuracy: 65.44104732696543
+---- test accuracy: 65.6320090343021
+-- w/ 10 clusters
+---- train accuracy: 78.5183585313175
+---- test accuracy: 78.76357233688739
+-- w/ 11 clusters
+---- train accuracy: 61.45671471181841
+---- test accuracy: 61.409551010122215
+-- w/ 12 clusters
+---- train accuracy: 93.77990430622008
+---- test accuracy: 93.41692789968648
+-- w/ 13 clusters
+---- train accuracy: 78.5183585313175
+---- test accuracy: 78.76357233688739
+-- w/ 14 clusters
+---- train accuracy: 3.4210526315789473
+---- test accuracy: 2.5477707006369426
+-- w/ 15 clusters
+---- train accuracy: 72.8898644391602
+---- test accuracy: 73.31212277172132
+-- w/ 16 clusters
+---- train accuracy: 1.187104401152624
+---- test accuracy: 1.2730165946806091
+

(systemds) branch main updated: [SYSTEMDS-3731] Initial cluster-and-classify ensemble primitive

Reply via email to