Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m and I made sparsity pattern 100:1.5M..we would like to increase the sparsity pattern to 1000:1.5M I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort shuffle will help.. I read in Reza paper that

Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Hi Reza, With 40 nodes and shuffle space managed by YARN over HDFS usercache we could run the similarity job without doing any thresholding...We used hash based shuffle and sort hopefully will further improve it...Note that this job was almost 6M x 1.5M We will go towards 50 M x ~ 3M columns and

Large Similarity Job failing

2015-02-17 Thread Debasish Das
Hi, I am running brute force similarity from RowMatrix on a job with 5M x 1.5M sparse matrix with 800M entries. With 200M entries the job run fine but with 800M I am getting exceptions like too many files open and no space left on device... Seems like I need more nodes or use dimsum sampling ?

Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng
The complexity of DIMSUM is independent of the number of rows but still have quadratic dependency on the number of columns. 1.5M columns may be too large to use DIMSUM. Try to increase the threshold and see whether it helps. -Xiangrui On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das