I don't think that's what the job does. It is computing the similarity of every row of A with every other row of A which is not what you are trying to do.
Even at tens of thousands of row in memory, that's not large at all and comfortably fits in memory. I would just continue with your non-distributed version. I think one fact that's overlooked is that distributing a computation typically introduces a load of overhead -- some constant scalar factor, and not a small one. It takes a lot of work to move all that data around. Distributing is a necessary evil, and I believe it should be avoided if you can avoid it. 2010/12/20 Fernando Fernández <[email protected]> > Hi Sebastian, > > Actually, this is related to some other message I sent a couple of days > ago. > What I really want to implement is an A to B similarity Job. A is at this > moment about 50K rows and B 1000 rows, but this will grow in the future > (possibly A to hundreds of thousands and B to tens of thousands) so I > thought a rowsimilarity over a C matrix (being C the rows of A and B put > together) would give me and Idea of the possible performance of this future > distributed A to B similarity job, and some results to check if the > methodology works for my problem. I have a non-distributed version right > now that solves the "50000 to 1000" problem in about 40 minutes on a single > machine, so I expect that a distributed version can solve the problem in > aprox (time / # of nodes), since I could simply split the A row-wise and > put > each piece on a node with a whole copy of B. So, as you say, something is > going really bad in my rowsimilarity proccess... maybe I just should > forget > using rowsimiliarty and implement a job that is not prepared to deal with > sparse matrices... > > Thank you all!! > > 2010/12/20 Sebastian Schelter <[email protected]> > > > Hi Fernando, > > > > If you set maxSimilaritiesPerRow to 100 it will return only the 100 most > > similar rows for each row. > > > > The density of your matrix could maybe explain the long execution time, > as > > the number of comparisons that need to be made might become quadratic > > because every row needs to be compared with every other row (50K times > 50K > > is up in the billions). RowSimilarityJob's purpose is to work on sparse > > matrices. > > > > Could you give us some details about your usecase? > > > > --sebastian > > > > > > > > > > > > On 20.12.2010 12:58, Fernando Fernández wrote: > > > >> Ok, understood now :) > >> > >> About the parameters: > >> > >> It's a 50000x100 dense matrix, so I set the --numberOfColumns parameter > to > >> 100, and the rest nophave the default values (This means that > >> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it will > >> return...) > >> > >> 2010/12/20 Sebastian Schelter<[email protected]> > >> > >> Hi, > >>> > >>> Most of mahout's algorithm implementations need to run a series of > >>> map/reduce jobs to compute their results. By specifying a start and > >>> endphase > >>> you can make the implementation run only some of these internal jobs. > You > >>> could e.g. use this to restart a failed execution. > >>> > >>> --sebastian > >>> > >>> > >>> > >>> On 20.12.2010 12:41, Fernando Fernández wrote: > >>> > >>> But, does this affect the result? What will I get if I launch > >>>> Rowsimiliarty > >>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't > fully > >>>> understand what "phases" exactly are in this case. > >>>> > >>>> 2010/12/20 Niall Riddell<[email protected]> > >>>> > >>>> Startphase and endphase shouldn't impact overall performance in any > >>>> way, > >>>> > >>>>> however it does mean that you can start at a later stage in a job > >>>>> pipeline. > >>>>> > >>>>> You can execute specific MR jobs by designating a startphase and > >>>>> endphase. > >>>>> It goes without saying that the correct inputs must be available to > >>>>> start > >>>>> a > >>>>> phase correctly. > >>>>> > >>>>> The first MR job is index 0. So setting --startPhase 1 will execute > >>>>> the > >>>>> 2nd > >>>>> job onwards. Putting in --endPhase 2 would stop after the 3rd job. > >>>>> On 20 Dec 2010 11:17, "Fernando Fernández"< > >>>>> [email protected]> wrote: > >>>>> > >>>>> Hello everyone, > >>>>>> > >>>>>> Can anyone explain what are exactly these two parameters (startphase > >>>>>> and > >>>>>> endphase) and how to use them? I'm trying to launch a RowSimilarity > >>>>>> job > >>>>>> > >>>>>> on > >>>>> a > >>>>> > >>>>> 50K row matrix (100 columns) with cosine similarity and default > >>>>>> > >>>>>> startphase > >>>>> > >>>>> and endphase parameters and I'm getting a extremely poor performance > >>>>>> on > >>>>>> a > >>>>>> quite big cluster (After 16 hours, only reached 3% of the proccess) > >>>>>> and > >>>>>> I > >>>>>> think that this could have something to do with startphase and > >>>>>> endphase > >>>>>> parameters. What do you think? How do these paremeters affect the > >>>>>> RowSimilarity job? > >>>>>> > >>>>>> Thanks in advance. > >>>>>> Fernando. > >>>>>> > >>>>>> > > >
