Digging some more: In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn of: {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}
Which then becomes the numerator and the denom. Looping, my next simCol is: {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} and then {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} ... Each time, those are getting added into the numerators/denoms value, such that by the time we are done looping (line 161), we have: numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} Not sure on how to interpret this as I haven't dug into the math here yet or figured out where those NaN are coming from originally. On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > > On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >> >> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >> >>> Where is the NaN coming up -- what has this value? >> >> simColumn seems to be the originator in the Aggregate step. For instance, >> my current breakpoint shows: >> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >> >> I can also see some in the PartialMultiplyMapper via the >> similarityMatrixColumn. >> >> Is that set by SimilarityMatrixRowWrapperMapper? >> <code> >> /* remove self similarity */ >> similarityMatrixRow.set(key.get(), Double.NaN); >> </code> > > Ah, but that is just taking care of itself, so maybe not the issue. > >> >> >> >>> It should be propagated in some cases but not others. I'm not aware of >>> any changes here. >> >> yeah, me neither. This is all related to MAHOUT-798. >> >>> >>> Generally small data sets will have this problem of not being able to >>> compute much of anything useful, so NaN might be right here. >>> But you say it was different recently, which seems to rule that out. >> >> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's >> just that's a whole lot harder to debug. >> >>> >>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsing...@apache.org> >>> wrote: >>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not >>>> getting any recommendations due to NaNs being calculated in the >>>> AggregateAndRecommend step. I'm not quite sure what is going on as it >>>> seems like this was working as little as two weeks ago (post Sebastian's >>>> big change to RecJob), but I don't see a whole lot of changes in that part >>>> of the code. >>>> >>>> The data is user id's mapping to email thread ids. My input data is >>>> simply a triple of user id, thread id, 1 (meaning that user participated >>>> in that thread) It seems like I will have a lot of good values in the >>>> inputs to the AggregateAndRecommend step, except one id will be NaN and >>>> this then seems to get added in and makes everything NaN (I realize this >>>> is a very naive understanding). I sense that I should be looking upstream >>>> in the process for a fix, but I am not sure where that is. >>>> >>>> Any ideas where I should be looking to eliminate these NaNs? If you want >>>> to try this with a small data set, you can get it here: >>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout >>>> (but note the companion article is not published yet.) >>>> >>>> Thanks, >>>> Grant >> >> > > -------------------------------------------- > Grant Ingersoll > http://www.lucidimagination.com > Lucene Eurocon 2011: http://www.lucene-eurocon.com > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com