Hi,

Can you describe your input in more details (how dense/sparse is the matrix?) and give us the exact parameters you started the job with? Something seems to be going very wrong with that execution time. Usually the running time of RowSimilarityJob is dominated by the longest row vector, but this shouldn't be a problem, if you have only 100 columns.

--sebastian

On 20.12.2010 12:27, Niall Riddell wrote:
Startphase and endphase shouldn't impact overall performance in any way,
however it does mean that you can start at a later stage in a job pipeline.

You can execute specific MR jobs by designating a startphase and endphase.
It goes without saying that the correct inputs must be available to start a
phase correctly.

The first MR job is index 0.  So setting --startPhase 1 will execute the 2nd
job onwards.  Putting in --endPhase 2 would stop after the 3rd job.
On 20 Dec 2010 11:17, "Fernando Fernández"<
[email protected]>  wrote:
Hello everyone,

Can anyone explain what are exactly these two parameters (startphase and
endphase) and how to use them? I'm trying to launch a RowSimilarity job on
a
50K row matrix (100 columns) with cosine similarity and default startphase
and endphase parameters and I'm getting a extremely poor performance on a
quite big cluster (After 16 hours, only reached 3% of the proccess) and I
think that this could have something to do with startphase and endphase
parameters. What do you think? How do these paremeters affect the
RowSimilarity job?

Thanks in advance.
Fernando.

Reply via email to