[ 
https://issues.apache.org/jira/browse/SYSTEMML-1043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15680623#comment-15680623
 ] 

Niketan Pansare edited comment on SYSTEMML-1043 at 11/20/16 7:49 AM:
---------------------------------------------------------------------

Hi [~iyounus]

I tried the above script with randomly generated sparse data and found that 
most of the time is spent in p4j data conversion. 

{code:java}
from pyspark import SparkContext
import systemml as sml
from systemml import random
import scipy
sc = SparkContext()
sml.setSparkContext(sc)
tfidf = scipy.sparse.rand(114720, 11590, density=0.01)
V = sml.matrix(tfidf)
#V = sml.load('tmp.mm.mtx', format='mm')
k = 40
m, n = V.shape
W = sml.random.uniform(size=(m, k))
H = sml.random.uniform(size=(k, n))
max_iters = 200
for i in range(max_iters):
    H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H)))
    W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose())))

sml.eval([H, W])
H = H.toNumPyArray()
W = W.toNumPyArray()
{code}

With density 0.01, it took 3 minutes.

But with density 0.1, creating a random sparse matrix and transfering it via 
py4j took a long time and finally failed before executing any code.

To overcome this issue, I have added a load function that will allow you to 
create a matrix from file. Please update the above code as follows:

{code:java}
...
#tfidf = scipy.sparse.rand(114720, 11590, density=0.01)
#V = sml.matrix(tfidf)
V = sml.load('tmp.mm.mtx', format='mm')
...
{code}


To generate the data and store it in the file,

{code:java}
import scipy
import scipy.io
tfidf = scipy.sparse.rand(114720, 11590, density=0.1)
scipy.io.mmwrite('tmp.mm', tfidf)
{code}

This took 8 minutes :)


was (Author: niketanpansare):
Hi [~iyounus]

I tried the above script with randomly generated sparse data and found that 
most of the time is spent in p4j data conversion. 

{code:java}
from pyspark import SparkContext
import systemml as sml
from systemml import random
import scipy
sc = SparkContext()
sml.setSparkContext(sc)
tfidf = scipy.sparse.rand(114720, 11590, density=0.01)
V = sml.matrix(tfidf)
#V = sml.load('tmp.mm.mtx', format='mm')
k = 40
m, n = V.shape
W = sml.random.uniform(size=(m, k))
H = sml.random.uniform(size=(k, n))
max_iters = 200
for i in range(max_iters):
    H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H)))
    W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose())))

sml.eval([H, W])
H = H.toNumPyArray()
W = W.toNumPyArray()
{code}

With density 0.01, it took 3 minutes.

But with density 0.1, creating a random sparse matrix and transfering it via 
py4j took a long time and finally failed before executing any code.

To overcome this issue, I have added a load function that will allow you to 
create a matrix from file (Please uncomment the lines in above code). To 
generate the data and store it in the file,

{code:java}
import scipy
import scipy.io
tfidf = scipy.sparse.rand(114720, 11590, density=0.1)
scipy.io.mmwrite('tmp.mm', tfidf)
{code}

This took 8 minutes :)

> NMF implementation taking too long
> ----------------------------------
>
>                 Key: SYSTEMML-1043
>                 URL: https://issues.apache.org/jira/browse/SYSTEMML-1043
>             Project: SystemML
>          Issue Type: Bug
>          Components: APIs, PyDML
>         Environment: standalone mode on labtop, and yarn cluster with 10 nodes
>            Reporter: Imran Younus
>
> I'm testing the following NMF algorithm written using python API:
> {code}
> from pyspark.sql import SQLContext
> import systemml as sml
> from systemml import random
> sqlContext = SQLContext(sc)
> sml.setSparkContext(sc)
> m, n = tfidf.shape
> k = 40
> V = sml.matrix(tfidf)
> W = sml.random.uniform(size=(m, k))
> H = sml.random.uniform(size=(k, n))
> max_iters = 200
> for i in range(max_iters):
>     H = H * (W.transpose().dot(V))/(W.transpose().dot(W.dot(H)))
>     W = W * (V.dot(H.transpose()))/(W.dot(H.dot(H.transpose())))
> W = W.toNumPyArray()
> {code}
> Here {{tfidf}} is a sparse matrix of shape (114720, 11590)
> The evaluation of {{W}} takes more than one hour when running on laptop. On 
> yarn cluster, it didn't finish in 1.5 hours (I killed the job).
> If I evaluate {{H}} matrix instead, it just takes 2 min.
> Note that even if I call {{eval}} before evaluating {{W}}, it doesn't make 
> any difference. {{W}} still takes an hour.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to