Were you able to get the data, Sebastian? On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote:
> Grant, > > Can you share a little more details about the results, do you get any > exceptions? Or do you just get no results? > > Using the NaNs inside the similarity matrix vectors has been included in > the job for a very long time and should not cause any problems. As Sean > already mentioned we have unit tests with toy data that should catch the > very obvious errors in this code. > > Can you share the dataset? I can do a testrun on my research cluster. > > --sebastian > > On 13.10.2011 08:37, Sean Owen wrote: >> RecommenderJob? The unit tests run it all the time. >> There should not be any glitches with static variables -- don't think >> there are any. >> >> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <goks...@gmail.com> wrote: >>> Is this job working well for anyone now? >>> When was the last time this job worked for someone? >>> >>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll >>> <gsing...@apache.org>wrote: >>> >>>> Both local and on EC2 >>>> >>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: >>>> >>>>> Hi Grant, >>>>> >>>>> Just curious, are you running this locally or distributed? >>>>> >>>>> I'd run into a similar issue, though in a completely different algorithm >>>> (Jimmy Lin's PageRank implementation) due to the use of a static variable. >>>>> >>>>> When running locally, this wasn't getting cleared between loops, and thus >>>> I got wonky results. >>>>> >>>>> The same thing would have happened with JVM reuse enabled. >>>>> >>>>> -- Ken >>>>> >>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: >>>>> >>>>>> Digging some more: >>>>>> >>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a >>>> simColumn of: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} >>>>>> >>>>>> Which then becomes the numerator and the denom. >>>>>> >>>>>> Looping, my next simCol is: >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} >>>>>> >>>>>> and then >>>>>> >>>> {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} >>>>>> >>>>>> ... >>>>>> >>>>>> Each time, those are getting added into the numerators/denoms value, >>>> such that by the time we are done looping (line 161), we have: >>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} >>>>>> >>>>>> numberOfSimilarItemsUsed: >>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} >>>>>> >>>>>> Not sure on how to interpret this as I haven't dug into the math here >>>> yet or figured out where those NaN are coming from originally. >>>>>> >>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: >>>>>> >>>>>>> >>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: >>>>>>> >>>>>>>> >>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: >>>>>>>> >>>>>>>>> Where is the NaN coming up -- what has this value? >>>>>>>> >>>>>>>> simColumn seems to be the originator in the Aggregate step. For >>>> instance, my current breakpoint shows: >>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} >>>>>>>> >>>>>>>> I can also see some in the PartialMultiplyMapper via the >>>> similarityMatrixColumn. >>>>>>>> >>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? >>>>>>>> <code> >>>>>>>> /* remove self similarity */ >>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); >>>>>>>> </code> >>>>>>> >>>>>>> Ah, but that is just taking care of itself, so maybe not the issue. >>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>>> It should be propagated in some cases but not others. I'm not aware >>>> of >>>>>>>>> any changes here. >>>>>>>> >>>>>>>> yeah, me neither. This is all related to MAHOUT-798. >>>>>>>> >>>>>>>>> >>>>>>>>> Generally small data sets will have this problem of not being able to >>>>>>>>> compute much of anything useful, so NaN might be right here. >>>>>>>>> But you say it was different recently, which seems to rule that out. >>>>>>>> >>>>>>>> I also _believe_ I'm seeing it in a much larger data set on Hadoop, >>>> it's just that's a whole lot harder to debug. >>>>>>>> >>>>>>>>> >>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < >>>> gsing...@apache.org> wrote: >>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and am not >>>> getting any recommendations due to NaNs being calculated in the >>>> AggregateAndRecommend step. I'm not quite sure what is going on as it >>>> seems >>>> like this was working as little as two weeks ago (post Sebastian's big >>>> change to RecJob), but I don't see a whole lot of changes in that part of >>>> the code. >>>>>>>>>> >>>>>>>>>> The data is user id's mapping to email thread ids. My input data is >>>> simply a triple of user id, thread id, 1 (meaning that user participated in >>>> that thread) It seems like I will have a lot of good values in the inputs >>>> to the AggregateAndRecommend step, except one id will be NaN and this then >>>> seems to get added in and makes everything NaN (I realize this is a very >>>> naive understanding). I sense that I should be looking upstream in the >>>> process for a fix, but I am not sure where that is. >>>>>>>>>> >>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs? If you >>>> want to try this with a small data set, you can get it here: >>>> http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(but >>>> note the companion article is not published yet.) >>>>>>>>>> >>>>>>>>>> Thanks, >>>>>>>>>> Grant >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> -------------------------------------------- >>>>>>> Grant Ingersoll >>>>>>> http://www.lucidimagination.com >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>>>>>> >>>>>> >>>>>> -------------------------------------------- >>>>>> Grant Ingersoll >>>>>> http://www.lucidimagination.com >>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>>>>> >>>>> >>>>> -------------------------- >>>>> Ken Krugler >>>>> +1 530-210-6378 >>>>> http://bixolabs.com >>>>> custom big data solutions & training >>>>> Hadoop, Cascading, Mahout & Solr >>>>> >>>>> >>>>> >>>> >>>> -------------------------------------------- >>>> Grant Ingersoll >>>> http://www.lucidimagination.com >>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com >>>> >>>> >>> >>> >>> -- >>> Lance Norskog >>> goks...@gmail.com >>> > -------------------------------------------- Grant Ingersoll http://www.lucidimagination.com Lucene Eurocon 2011: http://www.lucene-eurocon.com