Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5 m
and I made sparsity pattern 100:1.5M..we would like to increase the
sparsity pattern to 1000:1.5M

I am running 1.1 stable and I get random shuffle failures...may be 1.2 sort
shuffle will help..

I read in Reza paper that oversample works only if cols are skinny so I am
not very keen to oversample...
 On Feb 17, 2015 2:01 PM, Xiangrui Meng men...@gmail.com wrote:

 The complexity of DIMSUM is independent of the number of rows but
 still have quadratic dependency on the number of columns. 1.5M columns
 may be too large to use DIMSUM. Try to increase the threshold and see
 whether it helps. -Xiangrui

 On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am running brute force similarity from RowMatrix on a job with 5M x
 1.5M
  sparse matrix with 800M entries. With 200M entries the job run fine but
 with
  800M I am getting exceptions like too many files open and no space left
 on
  device...
 
  Seems like I need more nodes or use dimsum sampling ?
 
  I am running on 10 nodes where ulimit on each node is set at
 65K...Memory is
  not an issue since I can cache the dataset before similarity computation
  starts.
 
  I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both
 the
  jobs failed with FetchFailed msgs.
 
  Thanks.
  Deb



Re: Large Similarity Job failing

2015-02-25 Thread Debasish Das
Hi Reza,

With 40 nodes and shuffle space managed by YARN over HDFS usercache we
could run the similarity job without doing any thresholding...We used hash
based shuffle and sort hopefully will further improve it...Note that this
job was almost 6M x 1.5M

We will go towards 50 M x ~ 3M columns and increase the sparsity
pattern...Dimsum configurations will definitely help over there...

With a baseline run, it will be easier for me to now run dimsum sampling
and compare the results...I will try the configs that you pointed.

Thanks.
Deb

On Wed, Feb 25, 2015 at 3:52 PM, Reza Zadeh r...@databricks.com wrote:

 Hi Deb,

 Did you try using higher threshold values as I mentioned in an earlier
 email? Use RowMatrix.columnSimilarities(x) where x is some number ? Try
 the following values for x:
 0.1, 0.9, 10, 100

 And yes, the idea is that the matrix is skinny, you are pushing the
 boundary with 1.5m columns, because the output can potentially have 2.25 x
 10^12 entries, which is a lot. (squares 1.5m)

 Best,
 Reza


 On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das debasish.da...@gmail.com
 wrote:

 Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5
 m and I made sparsity pattern 100:1.5M..we would like to increase the
 sparsity pattern to 1000:1.5M

 I am running 1.1 stable and I get random shuffle failures...may be 1.2
 sort shuffle will help..

 I read in Reza paper that oversample works only if cols are skinny so I
 am not very keen to oversample...
  On Feb 17, 2015 2:01 PM, Xiangrui Meng men...@gmail.com wrote:

 The complexity of DIMSUM is independent of the number of rows but
 still have quadratic dependency on the number of columns. 1.5M columns
 may be too large to use DIMSUM. Try to increase the threshold and see
 whether it helps. -Xiangrui

 On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I am running brute force similarity from RowMatrix on a job with 5M x
 1.5M
  sparse matrix with 800M entries. With 200M entries the job run fine
 but with
  800M I am getting exceptions like too many files open and no space
 left on
  device...
 
  Seems like I need more nodes or use dimsum sampling ?
 
  I am running on 10 nodes where ulimit on each node is set at
 65K...Memory is
  not an issue since I can cache the dataset before similarity
 computation
  starts.
 
  I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable.
 Both the
  jobs failed with FetchFailed msgs.
 
  Thanks.
  Deb





Re: Large Similarity Job failing

2015-02-17 Thread Xiangrui Meng
The complexity of DIMSUM is independent of the number of rows but
still have quadratic dependency on the number of columns. 1.5M columns
may be too large to use DIMSUM. Try to increase the threshold and see
whether it helps. -Xiangrui

On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das debasish.da...@gmail.com wrote:
 Hi,

 I am running brute force similarity from RowMatrix on a job with 5M x 1.5M
 sparse matrix with 800M entries. With 200M entries the job run fine but with
 800M I am getting exceptions like too many files open and no space left on
 device...

 Seems like I need more nodes or use dimsum sampling ?

 I am running on 10 nodes where ulimit on each node is set at 65K...Memory is
 not an issue since I can cache the dataset before similarity computation
 starts.

 I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable. Both the
 jobs failed with FetchFailed msgs.

 Thanks.
 Deb

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org