[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

GitBox Sat, 03 Jul 2021 08:55:49 -0700


codeyeeter commented on a change in pull request #1323:
URL: https://github.com/apache/systemds/pull/1323#discussion_r663380211




##########
File path: src/main/python/tests/examples/tutorials/test_adult.py
##########
@@ -386,51 +386,35 @@ def test_level2(self):
 
         """""
         
################################################################################################################
-        X1, M1 = X1.transform_encode(spec=jspec).compute()
+        X1, M1 = X1.transform_encode(spec=jspec)
 
         
################################################################################################################
         """"
-        First we re-split out data into a training and a test set with the 
corresponding labels. We can then simply transform
-        the numpy array of the training data back to SystemDS matrix by using 
"sds.from_numpy()". 
-        The SystemDS scale function takes a matrix as an input and returns 
three output parameters:
-            # Y            Matrix    ---      Output feature matrix with K 
columns
-            # ColMean      Matrix    ---      The column means of the input, 
subtracted if Center was TRUE
-            # ScaleFactor  Matrix    ---      The Scaling of the values, to 
make each dimension have similar value ranges
-        If we want to retransform a SystemDs Matrix to a Numpy array we can do 
so by using the np.array() function. 
+        First we re-split out data into a training and a test set with the 
corresponding labels. 
         """""
         
################################################################################################################
-        col_length = len(X1[0])
-        X = X1[0:train_count, 0:col_length - 1]
-        Y = X1[0:train_count, col_length - 1:col_length].flatten()
-        # Test data
-        Xt = X1[train_count:train_count + test_count, 0:col_length - 1]
-        Yt = X1[train_count:train_count + test_count, col_length - 
1:col_length].flatten()
+        PREPROCESS_package = self.sds.source(self.preprocess_src_path, 
"preprocess", print_imported_methods=True)
 
+        X = PREPROCESS_package.get_X(X1, train_count)
+        Y = PREPROCESS_package.get_Y(X1, train_count)
+        #We lose the column count information after using the Preprocess 
Package. This triggers an error on multilogregpredict. Otherwise its working
+        Xt = self.sds.from_numpy(np.array(PREPROCESS_package.get_Xt(X1, 
train_count).compute()))

Review comment:
       We lose the column count information after splitting the matrix in a 
sourced dml file. Is there a way around this issue without relying on this 
pretty bad workaround?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [systemds] codeyeeter commented on a change in pull request #1323: [WIP][SYSTEMDS-2835] Python end-to-end tutorial

Reply via email to