Re: RowSimilarity startphase and endphase parameters

Sean Owen Mon, 20 Dec 2010 05:28:37 -0800

I don't think that's what the job does. It is computing the similarity of
every row of A with every other row of A which is not what you are trying to
do.


Even at tens of thousands of row in memory, that's not large at all and
comfortably fits in memory. I would just continue with your non-distributed
version.

I think one fact that's overlooked is that distributing a computation
typically introduces a load of overhead -- some constant scalar factor, and
not a small one. It takes a lot of work to move all that data around.
Distributing is a necessary evil, and I believe it should be avoided if you
can avoid it.

2010/12/20 Fernando Fernández <[email protected]>

> Hi Sebastian,
>
> Actually, this is related to some other message I sent a couple of days
> ago.
> What I really want to implement is an A to B similarity Job. A is at this
> moment about 50K rows and B 1000 rows, but this will grow in the future
> (possibly A to hundreds of thousands and B to tens of thousands) so I
> thought a rowsimilarity over a C matrix (being C the rows of A and B put
> together) would give me and Idea of the possible performance of this future
> distributed A to B similarity job, and some results to check if the
> methodology works for my problem.  I have a non-distributed version right
> now that solves the "50000 to 1000" problem in about 40 minutes on a single
> machine, so I expect that a distributed version can solve the problem in
> aprox (time / # of nodes), since I could simply split the A row-wise and
> put
> each piece on a node with a whole copy of B. So, as you say, something is
> going really bad in my  rowsimilarity proccess... maybe I just should
> forget
> using rowsimiliarty and implement a job that is not prepared to deal with
> sparse matrices...
>
> Thank you all!!
>
> 2010/12/20 Sebastian Schelter <[email protected]>
>
> > Hi  Fernando,
> >
> > If you set maxSimilaritiesPerRow to 100 it will return only the 100 most
> > similar rows for each row.
> >
> > The density of your matrix could maybe explain the long execution time,
> as
> > the number of comparisons that need to be made might become quadratic
> > because every row needs to be compared with every other row (50K times
> 50K
> > is up in the billions). RowSimilarityJob's purpose is to work on sparse
> > matrices.
> >
> > Could you give us some details about your usecase?
> >
> > --sebastian
> >
> >
> >
> >
> >
> > On 20.12.2010 12:58, Fernando Fernández wrote:
> >
> >> Ok, understood now :)
> >>
> >> About the parameters:
> >>
> >> It's a 50000x100 dense matrix, so I set the --numberOfColumns parameter
> to
> >> 100, and the rest nophave the default values (This means that
> >> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it will
> >> return...)
> >>
> >> 2010/12/20 Sebastian Schelter<[email protected]>
> >>
> >>  Hi,
> >>>
> >>> Most of mahout's algorithm implementations need to run a series of
> >>> map/reduce jobs to compute their results. By specifying a start and
> >>> endphase
> >>> you can make the implementation run only some of these internal jobs.
> You
> >>> could e.g. use this to restart a failed execution.
> >>>
> >>> --sebastian
> >>>
> >>>
> >>>
> >>> On 20.12.2010 12:41, Fernando Fernández wrote:
> >>>
> >>>  But, does this affect the result? What will I get if I launch
> >>>> Rowsimiliarty
> >>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't
> fully
> >>>> understand what "phases" exactly are in this case.
> >>>>
> >>>> 2010/12/20 Niall Riddell<[email protected]>
> >>>>
> >>>>  Startphase and endphase shouldn't impact overall performance in any
> >>>> way,
> >>>>
> >>>>> however it does mean that you can start at a later stage in a job
> >>>>> pipeline.
> >>>>>
> >>>>> You can execute specific MR jobs by designating a startphase and
> >>>>> endphase.
> >>>>> It goes without saying that the correct inputs must be available to
> >>>>> start
> >>>>> a
> >>>>> phase correctly.
> >>>>>
> >>>>> The first MR job is index 0.  So setting --startPhase 1 will execute
> >>>>> the
> >>>>> 2nd
> >>>>> job onwards.  Putting in --endPhase 2 would stop after the 3rd job.
> >>>>> On 20 Dec 2010 11:17, "Fernando Fernández"<
> >>>>> [email protected]>   wrote:
> >>>>>
> >>>>>  Hello everyone,
> >>>>>>
> >>>>>> Can anyone explain what are exactly these two parameters (startphase
> >>>>>> and
> >>>>>> endphase) and how to use them? I'm trying to launch a RowSimilarity
> >>>>>> job
> >>>>>>
> >>>>>>  on
> >>>>> a
> >>>>>
> >>>>>  50K row matrix (100 columns) with cosine similarity and default
> >>>>>>
> >>>>>>  startphase
> >>>>>
> >>>>>  and endphase parameters and I'm getting a extremely poor performance
> >>>>>> on
> >>>>>> a
> >>>>>> quite big cluster (After 16 hours, only reached 3% of the proccess)
> >>>>>> and
> >>>>>> I
> >>>>>> think that this could have something to do with startphase and
> >>>>>> endphase
> >>>>>> parameters. What do you think? How do these paremeters affect the
> >>>>>> RowSimilarity job?
> >>>>>>
> >>>>>> Thanks in advance.
> >>>>>> Fernando.
> >>>>>>
> >>>>>>
> >
>

Re: RowSimilarity startphase and endphase parameters

Reply via email to