Is the Apache public download bandwidth donated by Amazon? Or should we try to keep usage within AWS?
On Thu, Oct 13, 2011 at 3:47 AM, Grant Ingersoll <[email protected]>wrote: > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > Grant, > > > > Can you share a little more details about the results, do you get any > > exceptions? Or do you just get no results? > > No results. > > > > > Using the NaNs inside the similarity matrix vectors has been included in > > the job for a very long time and should not cause any problems. As Sean > > already mentioned we have unit tests with toy data that should catch the > > very obvious errors in this code. > > Yeah, I don't know what happened. I know I was getting results as little > as two weeks ago. I will try rolling back to an earlier commit. > > > > > Can you share the dataset? I can do a testrun on my research cluster. > > I already have earlier in this thread. There is a small set via the link > below or you can use the ASF email public dataset on Amazon or any subset of > it. > > > > > > --sebastian > > > > On 13.10.2011 08:37, Sean Owen wrote: > >> RecommenderJob? The unit tests run it all the time. > >> There should not be any glitches with static variables -- don't think > >> there are any. > >> > >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <[email protected]> > wrote: > >>> Is this job working well for anyone now? > >>> When was the last time this job worked for someone? > >>> > >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll <[email protected] > >wrote: > >>> > >>>> Both local and on EC2 > >>>> > >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >>>> > >>>>> Hi Grant, > >>>>> > >>>>> Just curious, are you running this locally or distributed? > >>>>> > >>>>> I'd run into a similar issue, though in a completely different > algorithm > >>>> (Jimmy Lin's PageRank implementation) due to the use of a static > variable. > >>>>> > >>>>> When running locally, this wasn't getting cleared between loops, and > thus > >>>> I got wonky results. > >>>>> > >>>>> The same thing would have happened with JVM reuse enabled. > >>>>> > >>>>> -- Ken > >>>>> > >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >>>>> > >>>>>> Digging some more: > >>>>>> > >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > >>>> simColumn of: > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >>>>>> > >>>>>> Which then becomes the numerator and the denom. > >>>>>> > >>>>>> Looping, my next simCol is: > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >>>>>> > >>>>>> and then > >>>>>> > >>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >>>>>> > >>>>>> ... > >>>>>> > >>>>>> Each time, those are getting added into the numerators/denoms value, > >>>> such that by the time we are done looping (line 161), we have: > >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>> > >>>>>> numberOfSimilarItemsUsed: > >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >>>>>> > >>>>>> Not sure on how to interpret this as I haven't dug into the math > here > >>>> yet or figured out where those NaN are coming from originally. > >>>>>> > >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >>>>>> > >>>>>>> > >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >>>>>>>> > >>>>>>>>> Where is the NaN coming up -- what has this value? > >>>>>>>> > >>>>>>>> simColumn seems to be the originator in the Aggregate step. For > >>>> instance, my current breakpoint shows: > >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > >>>>>>>> > >>>>>>>> I can also see some in the PartialMultiplyMapper via the > >>>> similarityMatrixColumn. > >>>>>>>> > >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? > >>>>>>>> <code> > >>>>>>>> /* remove self similarity */ > >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); > >>>>>>>> </code> > >>>>>>> > >>>>>>> Ah, but that is just taking care of itself, so maybe not the issue. > >>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>> > >>>>>>>>> It should be propagated in some cases but not others. I'm not > aware > >>>> of > >>>>>>>>> any changes here. > >>>>>>>> > >>>>>>>> yeah, me neither. This is all related to MAHOUT-798. > >>>>>>>> > >>>>>>>>> > >>>>>>>>> Generally small data sets will have this problem of not being > able to > >>>>>>>>> compute much of anything useful, so NaN might be right here. > >>>>>>>>> But you say it was different recently, which seems to rule that > out. > >>>>>>>> > >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on > Hadoop, > >>>> it's just that's a whole lot harder to debug. > >>>>>>>> > >>>>>>>>> > >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < > >>>> [email protected]> wrote: > >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am > not > >>>> getting any recommendations due to NaNs being calculated in the > >>>> AggregateAndRecommend step. I'm not quite sure what is going on as it > seems > >>>> like this was working as little as two weeks ago (post Sebastian's big > >>>> change to RecJob), but I don't see a whole lot of changes in that part > of > >>>> the code. > >>>>>>>>>> > >>>>>>>>>> The data is user id's mapping to email thread ids. My input > data is > >>>> simply a triple of user id, thread id, 1 (meaning that user > participated in > >>>> that thread) It seems like I will have a lot of good values in the > inputs > >>>> to the AggregateAndRecommend step, except one id will be NaN and this > then > >>>> seems to get added in and makes everything NaN (I realize this is a > very > >>>> naive understanding). I sense that I should be looking upstream in > the > >>>> process for a fix, but I am not sure where that is. > >>>>>>>>>> > >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs? If > you > >>>> want to try this with a small data set, you can get it here: > >>>> > http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote > the companion article is not published yet.) > >>>>>>>>>> > >>>>>>>>>> Thanks, > >>>>>>>>>> Grant > >>>>>>>> > >>>>>>>> > >>>>>>> > >>>>>>> -------------------------------------------- > >>>>>>> Grant Ingersoll > >>>>>>> http://www.lucidimagination.com > >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>>>>> > >>>>>> > >>>>>> -------------------------------------------- > >>>>>> Grant Ingersoll > >>>>>> http://www.lucidimagination.com > >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>>>> > >>>>> > >>>>> -------------------------- > >>>>> Ken Krugler > >>>>> +1 530-210-6378 > >>>>> http://bixolabs.com > >>>>> custom big data solutions & training > >>>>> Hadoop, Cascading, Mahout & Solr > >>>>> > >>>>> > >>>>> > >>>> > >>>> -------------------------------------------- > >>>> Grant Ingersoll > >>>> http://www.lucidimagination.com > >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>> > >>>> > >>> > >>> > >>> -- > >>> Lance Norskog > >>> [email protected] > >>> > > > > -------------------------- > Grant Ingersoll > http://www.lucidimagination.com > Lucene Eurocon 2011: http://www.lucene-eurocon.com > > > > -- Lance Norskog [email protected]
