Hi  Fernando,

If you set maxSimilaritiesPerRow to 100 it will return only the 100 most similar rows for each row.

The density of your matrix could maybe explain the long execution time, as the number of comparisons that need to be made might become quadratic because every row needs to be compared with every other row (50K times 50K is up in the billions). RowSimilarityJob's purpose is to work on sparse matrices.

Could you give us some details about your usecase?

--sebastian




On 20.12.2010 12:58, Fernando Fernández wrote:
Ok, understood now :)

About the parameters:

It's a 50000x100 dense matrix, so I set the --numberOfColumns parameter to
100, and the rest nophave the default values (This means that
maxSimilaritiesPerRow is set to 100, but I don't know which 100 it will
return...)

2010/12/20 Sebastian Schelter<[email protected]>

Hi,

Most of mahout's algorithm implementations need to run a series of
map/reduce jobs to compute their results. By specifying a start and endphase
you can make the implementation run only some of these internal jobs. You
could e.g. use this to restart a failed execution.

--sebastian



On 20.12.2010 12:41, Fernando Fernández wrote:

But, does this affect the result? What will I get if I launch
Rowsimiliarty
(cosine similarity) with --startphase=1 and --endPhase=2? I don't fully
understand what "phases" exactly are in this case.

2010/12/20 Niall Riddell<[email protected]>

  Startphase and endphase shouldn't impact overall performance in any way,
however it does mean that you can start at a later stage in a job
pipeline.

You can execute specific MR jobs by designating a startphase and
endphase.
It goes without saying that the correct inputs must be available to start
a
phase correctly.

The first MR job is index 0.  So setting --startPhase 1 will execute the
2nd
job onwards.  Putting in --endPhase 2 would stop after the 3rd job.
On 20 Dec 2010 11:17, "Fernando Fernández"<
[email protected]>   wrote:

Hello everyone,

Can anyone explain what are exactly these two parameters (startphase and
endphase) and how to use them? I'm trying to launch a RowSimilarity job

on
a

50K row matrix (100 columns) with cosine similarity and default

startphase

and endphase parameters and I'm getting a extremely poor performance on
a
quite big cluster (After 16 hours, only reached 3% of the proccess) and
I
think that this could have something to do with startphase and endphase
parameters. What do you think? How do these paremeters affect the
RowSimilarity job?

Thanks in advance.
Fernando.


Reply via email to