cd mahout/examples/bin ./build-asf-email.sh content/ out/ over/ select 1 for recommender
where content/ is content/coccoon.apache.org content/commons.apache.org and out/ and over/ are output directories. Run the shell script with -x as you will probably have to tweak it. Lance On Thu, Oct 13, 2011 at 11:04 PM, Sebastian Schelter <s...@apache.org> wrote: > Only got the raw data, how did you convert it to our standard > recommender input? > > --sebastian > > > On 14.10.2011 01:17, Grant Ingersoll wrote: > > Were you able to get the data, Sebastian? > > > > On Oct 13, 2011, at 4:01 AM, Sebastian Schelter wrote: > > > >> Grant, > >> > >> Can you share a little more details about the results, do you get any > >> exceptions? Or do you just get no results? > >> > >> Using the NaNs inside the similarity matrix vectors has been included in > >> the job for a very long time and should not cause any problems. As Sean > >> already mentioned we have unit tests with toy data that should catch the > >> very obvious errors in this code. > >> > >> Can you share the dataset? I can do a testrun on my research cluster. > >> > >> --sebastian > >> > >> On 13.10.2011 08:37, Sean Owen wrote: > >>> RecommenderJob? The unit tests run it all the time. > >>> There should not be any glitches with static variables -- don't think > >>> there are any. > >>> > >>> On Thu, Oct 13, 2011 at 7:33 AM, Lance Norskog <goks...@gmail.com> > wrote: > >>>> Is this job working well for anyone now? > >>>> When was the last time this job worked for someone? > >>>> > >>>> On Wed, Oct 12, 2011 at 11:30 AM, Grant Ingersoll < > gsing...@apache.org>wrote: > >>>> > >>>>> Both local and on EC2 > >>>>> > >>>>> On Oct 12, 2011, at 2:10 PM, Ken Krugler wrote: > >>>>> > >>>>>> Hi Grant, > >>>>>> > >>>>>> Just curious, are you running this locally or distributed? > >>>>>> > >>>>>> I'd run into a similar issue, though in a completely different > algorithm > >>>>> (Jimmy Lin's PageRank implementation) due to the use of a static > variable. > >>>>>> > >>>>>> When running locally, this wasn't getting cleared between loops, and > thus > >>>>> I got wonky results. > >>>>>> > >>>>>> The same thing would have happened with JVM reuse enabled. > >>>>>> > >>>>>> -- Ken > >>>>>> > >>>>>> On Oct 12, 2011, at 3:28pm, Grant Ingersoll wrote: > >>>>>> > >>>>>>> Digging some more: > >>>>>>> > >>>>>>> In AggregateAndRecommend, around lines 143, I have, for userId 0, a > >>>>> simColumn of: > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:0.9566912651062012,263376:NaN} > >>>>>>> > >>>>>>> Which then becomes the numerator and the denom. > >>>>>>> > >>>>>>> Looping, my next simCol is: > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:NaN,263374:0.9566912651062012,263376:0.9566912651062012} > >>>>>>> > >>>>>>> and then > >>>>>>> > >>>>> > {22966:0.9566912651062012,81901:0.9566912651062012,263375:0.9566912651062012,263374:NaN,263376:0.9566912651062012} > >>>>>>> > >>>>>>> ... > >>>>>>> > >>>>>>> Each time, those are getting added into the numerators/denoms > value, > >>>>> such that by the time we are done looping (line 161), we have: > >>>>>>> numerators: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>>> denoms: {22966:NaN,81901:NaN,263376:NaN,263375:NaN,263374:NaN} > >>>>>>> > >>>>>>> numberOfSimilarItemsUsed: > >>>>> {81901:5.0,22966:5.0,263376:5.0,263375:5.0,263374:5.0} > >>>>>>> > >>>>>>> Not sure on how to interpret this as I haven't dug into the math > here > >>>>> yet or figured out where those NaN are coming from originally. > >>>>>>> > >>>>>>> On Oct 11, 2011, at 2:55 PM, Grant Ingersoll wrote: > >>>>>>> > >>>>>>>> > >>>>>>>> On Oct 11, 2011, at 2:49 PM, Grant Ingersoll wrote: > >>>>>>>> > >>>>>>>>> > >>>>>>>>> On Oct 11, 2011, at 12:36 PM, Sean Owen wrote: > >>>>>>>>> > >>>>>>>>>> Where is the NaN coming up -- what has this value? > >>>>>>>>> > >>>>>>>>> simColumn seems to be the originator in the Aggregate step. For > >>>>> instance, my current breakpoint shows: > >>>>>>>>> {309682:0.9566912651062012,42938:0.9566912651062012,309672:NaN} > >>>>>>>>> > >>>>>>>>> I can also see some in the PartialMultiplyMapper via the > >>>>> similarityMatrixColumn. > >>>>>>>>> > >>>>>>>>> Is that set by SimilarityMatrixRowWrapperMapper? > >>>>>>>>> <code> > >>>>>>>>> /* remove self similarity */ > >>>>>>>>> similarityMatrixRow.set(key.get(), Double.NaN); > >>>>>>>>> </code> > >>>>>>>> > >>>>>>>> Ah, but that is just taking care of itself, so maybe not the > issue. > >>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> > >>>>>>>>>> It should be propagated in some cases but not others. I'm not > aware > >>>>> of > >>>>>>>>>> any changes here. > >>>>>>>>> > >>>>>>>>> yeah, me neither. This is all related to MAHOUT-798. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> Generally small data sets will have this problem of not being > able to > >>>>>>>>>> compute much of anything useful, so NaN might be right here. > >>>>>>>>>> But you say it was different recently, which seems to rule that > out. > >>>>>>>>> > >>>>>>>>> I also _believe_ I'm seeing it in a much larger data set on > Hadoop, > >>>>> it's just that's a whole lot harder to debug. > >>>>>>>>> > >>>>>>>>>> > >>>>>>>>>> On Tue, Oct 11, 2011 at 5:34 PM, Grant Ingersoll < > >>>>> gsing...@apache.org> wrote: > >>>>>>>>>>> I'm running trunk RecommenderJob (via build-asf-email.sh) and > am not > >>>>> getting any recommendations due to NaNs being calculated in the > >>>>> AggregateAndRecommend step. I'm not quite sure what is going on as > it seems > >>>>> like this was working as little as two weeks ago (post Sebastian's > big > >>>>> change to RecJob), but I don't see a whole lot of changes in that > part of > >>>>> the code. > >>>>>>>>>>> > >>>>>>>>>>> The data is user id's mapping to email thread ids. My input > data is > >>>>> simply a triple of user id, thread id, 1 (meaning that user > participated in > >>>>> that thread) It seems like I will have a lot of good values in the > inputs > >>>>> to the AggregateAndRecommend step, except one id will be NaN and this > then > >>>>> seems to get added in and makes everything NaN (I realize this is a > very > >>>>> naive understanding). I sense that I should be looking upstream in > the > >>>>> process for a fix, but I am not sure where that is. > >>>>>>>>>>> > >>>>>>>>>>> Any ideas where I should be looking to eliminate these NaNs? > If you > >>>>> want to try this with a small data set, you can get it here: > >>>>> > http://www.lucidimagination.com/devzone/technical-articles/scaling-mahout(butnote > the companion article is not published yet.) > >>>>>>>>>>> > >>>>>>>>>>> Thanks, > >>>>>>>>>>> Grant > >>>>>>>>> > >>>>>>>>> > >>>>>>>> > >>>>>>>> -------------------------------------------- > >>>>>>>> Grant Ingersoll > >>>>>>>> http://www.lucidimagination.com > >>>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>>>>>> > >>>>>>> > >>>>>>> -------------------------------------------- > >>>>>>> Grant Ingersoll > >>>>>>> http://www.lucidimagination.com > >>>>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>>>>> > >>>>>> > >>>>>> -------------------------- > >>>>>> Ken Krugler > >>>>>> +1 530-210-6378 > >>>>>> http://bixolabs.com > >>>>>> custom big data solutions & training > >>>>>> Hadoop, Cascading, Mahout & Solr > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>>> -------------------------------------------- > >>>>> Grant Ingersoll > >>>>> http://www.lucidimagination.com > >>>>> Lucene Eurocon 2011: http://www.lucene-eurocon.com > >>>>> > >>>>> > >>>> > >>>> > >>>> -- > >>>> Lance Norskog > >>>> goks...@gmail.com > >>>> > >> > > > > -------------------------------------------- > > Grant Ingersoll > > http://www.lucidimagination.com > > Lucene Eurocon 2011: http://www.lucene-eurocon.com > > > > > > -- Lance Norskog goks...@gmail.com