Hi,
I'm creating a process in SystemML, and running it through spark. I'm running
the code in the following way:
# Spark Specifications:
import os
import sys
import pandas as pd
import numpy as np
spark_path = "C:\spark"
os.environ['SPARK_HOME'] = spark_path
os.environ['HADOOP_HOME'] = spark_path
sys.path.append(spark_path + "/bin")
sys.path.append(spark_path + "/python")
sys.path.append(spark_path + "/python/pyspark/")
sys.path.append(spark_path + "/python/lib")
sys.path.append(spark_path + "/python/lib/pyspark.zip")
sys.path.append(spark_path + "/python/lib/py4j-0.10.4-src.zip")
from pyspark import SparkContext
from pyspark import SparkConf
sc = SparkContext("local[*]", "test")
# SystemML Specifications:
from pyspark.sql import SQLContext
import systemml as sml
sqlCtx = SQLContext(sc)
ml = sml.MLContext(sc)
# Importing the data
train_data= pd.read_csv("data1.csv")
test_data = pd.read_csv("data2.csv")
train_data = sqlCtx.createDataFrame(pd.DataFrame(train_data))
test_data = sqlCtx.createDataFrame(pd.DataFrame(test_data))
# Finally executing the code:
scriptUrl = "C:/systemml-0.13.0-incubating-bin/scripts/model_code.dml"
script = sml.dml(scriptUrl).input(bdframe_train =train_data , bdframe_test =
test_data).output("check_func")
beta = ml.execute(script).get("check_func").toNumPy()
pd.DataFrame(beta).head(1)
The datasize are 1000 & 100 rows for train and test respectively. I'm testing
it on small dataset during development. Later will test in larger dataset. I'm
running on my local system with 4 cores.
The problem is, if I run the model in R, it's taking fraction of second. But
when I'm running like this, it's taking around 20-30 seconds.
Could anyone please suggest me how to improve the execution speed? In case
there are any other way I can execute the code, which can improve the execution
speed.
Also, thank you all you guyz for releasing the 0.14 version. There are
fewimprovements we found extremely helpful.
Thank you!
Arijit