Digging some more:

In AggregateAndRecommend, around lines 143, I have, for userId 0, a simColumn 
of:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN}

Which then becomes the numerator and the denom.

Looping, my next simCol is:
{22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012}

and then
{22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012}

...

Each time, those are getting added into the numerators/denoms value, such that 
by the time we are done looping (line 161), we have:
numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}
denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN}

numberOfSimilarItemsUsed: {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0}

Not sure on how to interpret this as I haven't dug into the math here yet or 
figured out where those NaN are coming from originally.

On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote:

> 
> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote:
> 
>> 
>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote:
>> 
>>> Where is the NaN coming up -- what has this value?
>> 
>> simColumn seems to be the originator in the Aggregate step.  For instance, 
>> my current breakpoint shows:
>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN}
>> 
>> I can also see some in the PartialMultiplyMapper via the 
>> similarityMatrixColumn.  
>> 
>> Is that set by SimilarityMatrixRowWrapperMapper?
>> <code>
>> /* remove self similarity */
>>   similarityMatrixRow.set(key.get(), Double.NaN);
>> </code>
> 
> Ah, but that is just taking care of itself, so maybe not the issue.
> 
>> 
>> 
>> 
>>> It should be propagated in some cases but not others. I'm not aware of
>>> any changes here.
>> 
>> yeah, me neither.  This is all related to MAHOUT-798.
>> 
>>> 
>>> Generally small data sets will have this problem of not being able to
>>> compute much of anything useful, so NaN might be right here.
>>> But you say it was different recently, which seems to rule that out.
>> 
>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, it's 
>> just that's a whole lot harder to debug.
>> 
>>> 
>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll <gsing...@apache.org> 
>>> wrote:
>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not 
>>>> getting any recommendations due to NaNs being calculated in the 
>>>> AggregateAndRecommend step.  I'm not quite sure what is going on as it 
>>>> seems like this was working as little as two weeks ago (post Sebastian's 
>>>> big change to RecJob), but I don't see a whole lot of changes in that part 
>>>> of the code.
>>>> 
>>>> The data is user id's mapping to email thread ids.  My input data is 
>>>> simply a triple of user id, thread id, 1 (meaning that user participated 
>>>> in that thread)  It seems like I will have a lot of good values in the 
>>>> inputs to the AggregateAndRecommend step, except one id will be NaN and 
>>>> this then seems to get added in and makes everything NaN (I realize this 
>>>> is a very naive understanding).  I sense that I should be looking upstream 
>>>> in the process for a fix, but I am not sure where that is.
>>>> 
>>>> Any ideas where I should be looking to eliminate these NaNs?  If you want 
>>>> to try this with a small data set, you can get it here: 
>>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout 
>>>> (but note the companion article is not published yet.)
>>>> 
>>>> Thanks,
>>>> Grant
>> 
>> 
> 
> --------------------------------------------
> Grant Ingersoll
> http://www.lucidimagination.com
> Lucene Eurocon 2011: http://www.lucene-eurocon.com
> 

--------------------------------------------
Grant Ingersoll
http://www.lucidimagination.com
Lucene Eurocon 2011: http://www.lucene-eurocon.com

Reply via email to