[ https://issues.apache.org/jira/browse/SYSTEMML-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680623#comment-15680623 ]
Niketan Pansare commented on SYSTEMML-1043: ------------------------------------------- Hi [~iyounus] I tried the above script with randomly generated sparse data and found that most of the time is spent in p4j data conversion. {code:python} from pyspark import SparkContext import systemml as sml from systemml import random import scipy sc = SparkContext() sml.setSparkContext(sc) tfidf = scipy.sparse.rand(114720, 11590, density=0.01) V = sml.matrix(tfidf) #V = sml.load('tmp.mm.mtx', format='mm') k = 40 m, n = V.shape W = sml.random.uniform(size=(m, k)) H = sml.random.uniform(size=(k, n)) max_iters = 200 for i in range(max_iters): H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H))) W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose()))) sml.eval([H, W]) H = H.toNumPyArray() W = W.toNumPyArray() {code} With density 0.01, it took 3 minutes. But with density 0.1, creating a random sparse matrix and transfering it via py4j took a long time and finally failed before executing any code. To overcome this issue, I have added a load function that will allow you to create a matrix from file (Please uncomment the lines in above code). To generate the data and store it in the file, {code:python} import scipy import scipy.io tfidf = scipy.sparse.rand(114720, 11590, density=0.1) scipy.io.mmwrite('tmp.mm', tfidf) {code} This took 8 minutes :) > NMF implementation taking too long > ---------------------------------- > > Key: SYSTEMML-1043 > URL: https://issues.apache.org/jira/browse/SYSTEMML-1043 > Project: SystemML > Issue Type: Bug > Components: APIs, PyDML > Environment: standalone mode on labtop, and yarn cluster with 10 nodes > Reporter: Imran Younus > > I'm testing the following NMF algorithm written using python API: > {code} > from pyspark.sql import SQLContext > import systemml as sml > from systemml import random > sqlContext = SQLContext(sc) > sml.setSparkContext(sc) > m, n = tfidf.shape > k = 40 > V = sml.matrix(tfidf) > W = sml.random.uniform(size=(m, k)) > H = sml.random.uniform(size=(k, n)) > max_iters = 200 > for i in range(max_iters): > H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H))) > W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose()))) > W = W.toNumPyArray() > {code} > Here {{tfidf}} is a sparse matrix of shape (114720, 11590) > The evaluation of {{W}} takes more than one hour when running on laptop. On > yarn cluster, it didn't finish in 1.5 hours (I killed the job). > If I evaluate {{H}} matrix instead, it just takes 2 min. > Note that even if I call {{eval}} before evaluating {{W}}, it doesn't make > any difference. {{W}} still takes an hour. -- This message was sent by Atlassian JIRA (v6.3.4#6332)