On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > > On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >> Where is the NaN coming up -- what has this value? > > simColumn seems to be the originator in the Aggregate step. For instance, my > current breakpoint shows: > {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > > I can also see some in the PartialMultiplyMapper via the > similarityMatrixColumn. > > Is that set by SimilarityMatrixRowWrapperMapper? > <code> > /* remove self similarity */ > similarityMatrixRow.set(key.get(), Double.NaN); > </code>
Ah, but that is just taking care of itself, so maybe not the issue. > > > >> It should be propagated in some cases but not others. I'm not aware of >> any changes here. > > yeah, me neither. This is all related to MAHOUT-798. > >> >> Generally small data sets will have this problem of not being able to >> compute much of anything useful, so NaN might be right here. >> But you say it was different recently, which seems to rule that out. > > I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just > that's a whole lot harder to debug. > >> >> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsing...@apache.org> wrote: >>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not >>> getting any recommendations due to NaNs being calculated in the >>> AggregateAndRecommend step. I'm not quite sure what is going on as it >>> seems like this was working as little as two weeks ago (post Sebastian's >>> big change to RecJob), but I don't see a whole lot of changes in that part >>> of the code. >>> >>> The data is user id's mapping to email thread ids. My input data is simply >>> a triple of user id, thread id, 1 (meaning that user participated in that >>> thread) It seems like I will have a lot of good values in the inputs to >>> the AggregateAndRecommend step, except one id will be NaN and this then >>> seems to get added in and makes everything NaN (I realize this is a very >>> naive understanding). I sense that I should be looking upstream in the >>> process for a fix, but I am not sure where that is. >>> >>> Any ideas where I should be looking to eliminate these NaNs? If you want >>> to try this with a small data set, you can get it here: >>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout >>> (but note the companion article is not published yet.) >>> >>> Thanks, >>> Grant > > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com