Hi Sebastian, Actually, this is related to some other message I sent a couple of days ago. What I really want to implement is an A to B similarity Job. A is at this moment about 50K rows and B 1000 rows, but this will grow in the future (possibly A to hundreds of thousands and B to tens of thousands) so I thought a rowsimilarity over a C matrix (being C the rows of A and B put together) would give me and Idea of the possible performance of this future distributed A to B similarity job, and some results to check if the methodology works for my problem. I have a non-distributed version right now that solves the "50000 to 1000" problem in about 40 minutes on a single machine, so I expect that a distributed version can solve the problem in aprox (time / # of nodes), since I could simply split the A row-wise and put each piece on a node with a whole copy of B. So, as you say, something is going really bad in my rowsimilarity proccess... maybe I just should forget using rowsimiliarty and implement a job that is not prepared to deal with sparse matrices...
Thank you all!! 2010/12/20 Sebastian Schelter <[email protected]> > Hi Fernando, > > If you set maxSimilaritiesPerRow to 100 it will return only the 100 most > similar rows for each row. > > The density of your matrix could maybe explain the long execution time, as > the number of comparisons that need to be made might become quadratic > because every row needs to be compared with every other row (50K times 50K > is up in the billions). RowSimilarityJob's purpose is to work on sparse > matrices. > > Could you give us some details about your usecase? > > --sebastian > > > > > > On 20.12.2010 12:58, Fernando Fernández wrote: > >> Ok, understood now :) >> >> About the parameters: >> >> It's a 50000x100 dense matrix, so I set the --numberOfColumns parameter to >> 100, and the rest nophave the default values (This means that >> maxSimilaritiesPerRow is set to 100, but I don't know which 100 it will >> return...) >> >> 2010/12/20 Sebastian Schelter<[email protected]> >> >> Hi, >>> >>> Most of mahout's algorithm implementations need to run a series of >>> map/reduce jobs to compute their results. By specifying a start and >>> endphase >>> you can make the implementation run only some of these internal jobs. You >>> could e.g. use this to restart a failed execution. >>> >>> --sebastian >>> >>> >>> >>> On 20.12.2010 12:41, Fernando Fernández wrote: >>> >>> But, does this affect the result? What will I get if I launch >>>> Rowsimiliarty >>>> (cosine similarity) with --startphase=1 and --endPhase=2? I don't fully >>>> understand what "phases" exactly are in this case. >>>> >>>> 2010/12/20 Niall Riddell<[email protected]> >>>> >>>> Startphase and endphase shouldn't impact overall performance in any >>>> way, >>>> >>>>> however it does mean that you can start at a later stage in a job >>>>> pipeline. >>>>> >>>>> You can execute specific MR jobs by designating a startphase and >>>>> endphase. >>>>> It goes without saying that the correct inputs must be available to >>>>> start >>>>> a >>>>> phase correctly. >>>>> >>>>> The first MR job is index 0. So setting --startPhase 1 will execute >>>>> the >>>>> 2nd >>>>> job onwards. Putting in --endPhase 2 would stop after the 3rd job. >>>>> On 20 Dec 2010 11:17, "Fernando Fernández"< >>>>> [email protected]> wrote: >>>>> >>>>> Hello everyone, >>>>>> >>>>>> Can anyone explain what are exactly these two parameters (startphase >>>>>> and >>>>>> endphase) and how to use them? I'm trying to launch a RowSimilarity >>>>>> job >>>>>> >>>>>> on >>>>> a >>>>> >>>>> 50K row matrix (100 columns) with cosine similarity and default >>>>>> >>>>>> startphase >>>>> >>>>> and endphase parameters and I'm getting a extremely poor performance >>>>>> on >>>>>> a >>>>>> quite big cluster (After 16 hours, only reached 3% of the proccess) >>>>>> and >>>>>> I >>>>>> think that this could have something to do with startphase and >>>>>> endphase >>>>>> parameters. What do you think? How do these paremeters affect the >>>>>> RowSimilarity job? >>>>>> >>>>>> Thanks in advance. >>>>>> Fernando. >>>>>> >>>>>> >
