Hi Reza,

With 40 nodes and shuffle space managed by YARN over HDFS usercache we
could run the similarity job without doing any thresholding...We used hash
based shuffle and sort hopefully will further improve it...Note that this
job was almost 6M x 1.5M

We will go towards 50 M x ~ 3M columns and increase the sparsity
pattern...Dimsum configurations will definitely help over there...

With a baseline run, it will be easier for me to now run dimsum sampling
and compare the results...I will try the configs that you pointed.

Thanks.
Deb

On Wed, Feb 25, 2015 at 3:52 PM, Reza Zadeh <r...@databricks.com> wrote:

> Hi Deb,
>
> Did you try using higher threshold values as I mentioned in an earlier
> email? Use RowMatrix.columnSimilarities(x) where x is some number ? Try
> the following values for x:
> 0.1, 0.9, 10, 100
>
> And yes, the idea is that the matrix is skinny, you are pushing the
> boundary with 1.5m columns, because the output can potentially have 2.25 x
> 10^12 entries, which is a lot. (squares 1.5m)
>
> Best,
> Reza
>
>
> On Wed, Feb 25, 2015 at 10:13 AM, Debasish Das <debasish.da...@gmail.com>
> wrote:
>
>> Is the threshold valid only for tall skinny matrices ? Mine is 6 m x 1.5
>> m and I made sparsity pattern 100:1.5M..we would like to increase the
>> sparsity pattern to 1000:1.5M
>>
>> I am running 1.1 stable and I get random shuffle failures...may be 1.2
>> sort shuffle will help..
>>
>> I read in Reza paper that oversample works only if cols are skinny so I
>> am not very keen to oversample...
>>  On Feb 17, 2015 2:01 PM, "Xiangrui Meng" <men...@gmail.com> wrote:
>>
>>> The complexity of DIMSUM is independent of the number of rows but
>>> still have quadratic dependency on the number of columns. 1.5M columns
>>> may be too large to use DIMSUM. Try to increase the threshold and see
>>> whether it helps. -Xiangrui
>>>
>>> On Tue, Feb 17, 2015 at 6:28 AM, Debasish Das <debasish.da...@gmail.com>
>>> wrote:
>>> > Hi,
>>> >
>>> > I am running brute force similarity from RowMatrix on a job with 5M x
>>> 1.5M
>>> > sparse matrix with 800M entries. With 200M entries the job run fine
>>> but with
>>> > 800M I am getting exceptions like too many files open and no space
>>> left on
>>> > device...
>>> >
>>> > Seems like I need more nodes or use dimsum sampling ?
>>> >
>>> > I am running on 10 nodes where ulimit on each node is set at
>>> 65K...Memory is
>>> > not an issue since I can cache the dataset before similarity
>>> computation
>>> > starts.
>>> >
>>> > I tested the same job on YARN with Spark 1.1 and Spark 1.2 stable.
>>> Both the
>>> > jobs failed with FetchFailed msgs.
>>> >
>>> > Thanks.
>>> > Deb
>>>
>>
>

Reply via email to