Hi,
I'm creating a process in SystemML, and running it through spark. I'm running the code in the following way: # Spark Specifications: import os import sys import pandas as pd import numpy as np spark_path = "C:\spark" os.environ['SPARK_HOME'] = spark_path os.environ['HADOOP_HOME'] = spark_path sys.path.append(spark_path + "/bin") sys.path.append(spark_path + "/python") sys.path.append(spark_path + "/python/pyspark/") sys.path.append(spark_path + "/python/lib") sys.path.append(spark_path + "/python/lib/pyspark.zip") sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip") from pyspark import SparkContext from pyspark import SparkConf sc = SparkContext("local[*]", "test") # SystemML Specifications: from pyspark.sql import SQLContext import systemml as sml sqlCtx = SQLContext(sc) ml = sml.MLContext(sc) # Importing the data train_data= pd.read_csv("data1.csv") test_data = pd.read_csv("data2.csv") train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data)) test_data = sqlCtx.createDataFrame(pd.DataFrame(test_data)) # Finally executing the code: scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml" script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test = test_data).output("check_func") beta = ml.execute(script).get("check_func").toNumPy() pd.DataFrame(beta).head(1) The datasize are 1000 & 100 rows for train and test respectively. I'm testing it on small dataset during development. Later will test in larger dataset. I'm running on my local system with 4 cores. The problem is, if I run the model in R, it's taking fraction of second. But when I'm running like this, it's taking around 20-30 seconds. Could anyone please suggest me how to improve the execution speed? In case there are any other way I can execute the code, which can improve the execution speed. Also, thank you all you guyz for releasing the 0.14 version. There are fewimprovements we found extremely helpful. Thank you! Arijit