On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:

> 
> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
> 
>> Where is the NaN coming up -- what has this value?
> 
> simColumn seems to be the originator in the Aggregate step.  For instance, my 
> current breakpoint shows:
> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
> 
> I can also see some in the PartialMultiplyMapper via the 
> similarityMatrixColumn.  
> 
> Is that set by SimilarityMatrixRowWrapperMapper?
> <code>
> /* remove self similarity */
>    similarityMatrixRow.set(key.get(), Double.NaN);
> </code>

Ah, but that is just taking care of itself, so maybe not the issue.

> 
> 
> 
>> It should be propagated in some cases but not others. I'm not aware of
>> any changes here.
> 
> yeah, me neither.  This is all related to MAHOUT-798.
> 
>> 
>> Generally small data sets will have this problem of not being able to
>> compute much of anything useful, so NaN might be right here.
>> But you say it was different recently, which seems to rule that out.
> 
> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's just 
> that's a whole lot harder to debug.
> 
>> 
>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsing...@apache.org> wrote:
>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not 
>>> getting any recommendations due to NaNs being calculated in the 
>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it 
>>> seems like this was working as little as two weeks ago (post Sebastian's 
>>> big change to RecJob), but I don't see a whole lot of changes in that part 
>>> of the code.
>>> 
>>> The data is user id's mapping to email thread ids.  My input data is simply 
>>> a triple of user id, thread id, 1 (meaning that user participated in that 
>>> thread)  It seems like I will have a lot of good values in the inputs to 
>>> the AggregateAndRecommend step, except one id will be NaN and this then 
>>> seems to get added in and makes everything NaN (I realize this is a very 
>>> naive understanding).  I sense that I should be looking upstream in the 
>>> process for a fix, but I am not sure where that is.
>>> 
>>> Any ideas where I should be looking to eliminate these NaNs?  If you want 
>>> to try this with a small data set, you can get it here: 
>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout 
>>> (but note the companion article is not published yet.)
>>> 
>>> Thanks,
>>> Grant
> 
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to